Source of the materials: Biopython cookbook (adapted) Status: Draft
Cookbook – Cool things to do with it¶
Biopython now has two collections of “cookbook” examples – this chapter (which has been included in this tutorial for many years and has gradually grown), and http://biopython.org/wiki/Category:Cookbook which is a user contributed collection on our wiki.
We’re trying to encourage Biopython users to contribute their own examples to the wiki. In addition to helping the community, one direct benefit of sharing an example like this is that you could also get some feedback on the code from other Biopython users and developers - which could help you improve all your Python code.
In the long term, we may end up moving all of the examples in this chapter to the wiki, or elsewhere within the tutorial.
Working with sequence files¶
This section shows some more examples of sequence input/output, using
the Bio.SeqIO
module described in Chapter [chapter:Bio.SeqIO].
Filtering a sequence file¶
Often you’ll have a large file with many sequences in it (e.g. FASTA file or genes, or a FASTQ or SFF file of reads), a separate shorter list of the IDs for a subset of sequences of interest, and want to make a new sequence file for this subset.
Let’s say the list of IDs is in a simple text file, as the first word on each line. This could be a tabular file where the first column is the ID. Try something like this:
from Bio import SeqIO
input_file = "big_file.sff"
id_file = "short_list.txt"
output_file = "short_list.sff"
wanted = set(line.rstrip("\n").split(None,1)[0] for line in open(id_file))
print("Found %i unique identifiers in %s" % (len(wanted), id_file))
records = (r for r in SeqIO.parse(input_file, "sff") if r.id in wanted)
count = SeqIO.write(records, output_file, "sff")
print("Saved %i records from %s to %s" % (count, input_file, output_file))
if count < len(wanted):
print("Warning %i IDs not found in %s" % (len(wanted)-count, input_file))
Note that we use a Python set
rather than a list
, this makes
testing membership faster.
Producing randomised genomes¶
Let’s suppose you are looking at genome sequence, hunting for some sequence feature – maybe extreme local GC% bias, or possible restriction digest sites. Once you’ve got your Python code working on the real genome it may be sensible to try running the same search on randomised versions of the same genome for statistical analysis (after all, any “features” you’ve found could just be there just by chance).
For this discussion, we’ll use the GenBank file for the pPCP1 plasmid
from Yersinia pestis biovar Microtus. The file is included with the
Biopython unit tests under the GenBank folder, or you can get it from
our website,
`NC_005816.gb
<http://biopython.org/SRC/biopython/Tests/GenBank/NC_005816.gb>`__.
This file contains one and only one record, so we can read it in as a
SeqRecord
using the Bio.SeqIO.read()
function:
In [2]:
from Bio import SeqIO
original_rec = SeqIO.read("data/NC_005816.gb", "genbank")
So, how can we generate a shuffled versions of the original sequence? I
would use the built in Python random
module for this, in particular
the function random.shuffle
– but this works on a Python list. Our
sequence is a Seq
object, so in order to shuffle it we need to turn
it into a list:
In [3]:
import random
nuc_list = list(original_rec.seq)
random.shuffle(nuc_list) #acts in situ!
Now, in order to use Bio.SeqIO
to output the shuffled sequence, we
need to construct a new SeqRecord
with a new Seq
object using
this shuffled list. In order to do this, we need to turn the list of
nucleotides (single letter strings) into a long string – the standard
Python way to do this is with the string object’s join method.
In [4]:
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
shuffled_rec = SeqRecord(Seq("".join(nuc_list), original_rec.seq.alphabet),
id="Shuffled", description="Based on %s" % original_rec.id)
Let’s put all these pieces together to make a complete Python script which generates a single FASTA file containing 30 randomly shuffled versions of the original sequence.
This first version just uses a big for loop and writes out the records
one by one (using the SeqRecord
’s format method described in
Section [sec:Bio.SeqIO-and-StringIO]):
import random
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio import SeqIO
original_rec = SeqIO.read("NC_005816.gb","genbank")
handle = open("shuffled.fasta", "w")
for i in range(30):
nuc_list = list(original_rec.seq)
random.shuffle(nuc_list)
shuffled_rec = SeqRecord(Seq("".join(nuc_list), original_rec.seq.alphabet), \
id="Shuffled%i" % (i+1), \
description="Based on %s" % original_rec.id)
handle.write(shuffled_rec.format("fasta"))
handle.close()
Personally I prefer the following version using a function to shuffle the record and a generator expression instead of the for loop:
import random
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio import SeqIO
def make_shuffle_record(record, new_id):
nuc_list = list(record.seq)
random.shuffle(nuc_list)
return SeqRecord(Seq("".join(nuc_list), record.seq.alphabet), \
id=new_id, description="Based on %s" % original_rec.id)
original_rec = SeqIO.read("NC_005816.gb","genbank")
shuffled_recs = (make_shuffle_record(original_rec, "Shuffled%i" % (i+1)) \
for i in range(30))
handle = open("shuffled.fasta", "w")
SeqIO.write(shuffled_recs, handle, "fasta")
handle.close()
Translating a FASTA file of CDS entries¶
Suppose you’ve got an input file of CDS entries for some organism, and
you want to generate a new FASTA file containing their protein
sequences. i.e. Take each nucleotide sequence from the original file,
and translate it. Back in Section [sec:translation] we saw how to use
the Seq
object’s translate method
, and the optional cds
argument which enables correct translation of alternative start codons.
We can combine this with Bio.SeqIO
as shown in the reverse
complement example in Section [sec:SeqIO-reverse-complement]. The key
point is that for each nucleotide SeqRecord
, we need to create a
protein SeqRecord
- and take care of naming it.
You can write you own function to do this, choosing suitable protein identifiers for your sequences, and the appropriate genetic code. In this example we just use the default table and add a prefix to the identifier:
from Bio.SeqRecord import SeqRecord
def make_protein_record(nuc_record):
"""Returns a new SeqRecord with the translated sequence (default table)."""
return SeqRecord(seq = nuc_record.seq.translate(cds=True), \
id = "trans_" + nuc_record.id, \
description = "translation of CDS, using default table")
We can then use this function to turn the input nucleotide records into protein records ready for output. An elegant way and memory efficient way to do this is with a generator expression:
from Bio import SeqIO
proteins = (make_protein_record(nuc_rec) for nuc_rec in \
SeqIO.parse("coding_sequences.fasta", "fasta"))
SeqIO.write(proteins, "translations.fasta", "fasta")
This should work on any FASTA file of complete coding sequences. If you
are working on partial coding sequences, you may prefer to use
nuc_record.seq.translate(to_stop=True)
in the example above, as this
wouldn’t check for a valid start codon etc.
Making the sequences in a FASTA file upper case¶
Often you’ll get data from collaborators as FASTA files, and sometimes
the sequences can be in a mixture of upper and lower case. In some cases
this is deliberate (e.g. lower case for poor quality regions), but
usually it is not important. You may want to edit the file to make
everything consistent (e.g. all upper case), and you can do this easily
using the upper()
method of the SeqRecord
object (added in
Biopython 1.55):
from Bio import SeqIO
records = (rec.upper() for rec in SeqIO.parse("mixed.fas", "fasta"))
count = SeqIO.write(records, "upper.fas", "fasta")
print("Converted %i records to upper case" % count)
How does this work? The first line is just importing the Bio.SeqIO
module. The second line is the interesting bit – this is a Python
generator expression which gives an upper case version of each record
parsed from the input file (mixed.fas
). In the third line we give
this generator expression to the Bio.SeqIO.write()
function and it
saves the new upper cases records to our output file (upper.fas
).
The reason we use a generator expression (rather than a list or list comprehension) is this means only one record is kept in memory at a time. This can be really important if you are dealing with large files with millions of entries.
Sorting a sequence file¶
Suppose you wanted to sort a sequence file by length (e.g. a set of
contigs from an assembly), and you are working with a file format like
FASTA or FASTQ which Bio.SeqIO
can read, write (and index).
If the file is small enough, you can load it all into memory at once as
a list of SeqRecord
objects, sort the list, and save it:
from Bio import SeqIO
records = list(SeqIO.parse("ls_orchid.fasta","fasta"))
records.sort(cmp=lambda x,y: cmp(len(x),len(y)))
SeqIO.write(records, "sorted_orchids.fasta", "fasta")
The only clever bit is specifying a comparison function for how to sort the records (here we sort them by length). If you wanted the longest records first, you could flip the comparison or use the reverse argument:
from Bio import SeqIO
records = list(SeqIO.parse("ls_orchid.fasta","fasta"))
records.sort(cmp=lambda x,y: cmp(len(y),len(x)))
SeqIO.write(records, "sorted_orchids.fasta", "fasta")
Now that’s pretty straight forward - but what happens if you have a very
large file and you can’t load it all into memory like this? For example,
you might have some next-generation sequencing reads to sort by length.
This can be solved using the Bio.SeqIO.index()
function.
from Bio import SeqIO
#Get the lengths and ids, and sort on length
len_and_ids = sorted((len(rec), rec.id) for rec in \
SeqIO.parse("ls_orchid.fasta","fasta"))
ids = reversed([id for (length, id) in len_and_ids])
del len_and_ids #free this memory
record_index = SeqIO.index("ls_orchid.fasta", "fasta")
records = (record_index[id] for id in ids)
SeqIO.write(records, "sorted.fasta", "fasta")
First we scan through the file once using Bio.SeqIO.parse()
,
recording the record identifiers and their lengths in a list of tuples.
We then sort this list to get them in length order, and discard the
lengths. Using this sorted list of identifiers Bio.SeqIO.index()
allows us to retrieve the records one by one, and we pass them to
Bio.SeqIO.write()
for output.
These examples all use Bio.SeqIO
to parse the records into
SeqRecord
objects which are output using Bio.SeqIO.write()
. What
if you want to sort a file format which Bio.SeqIO.write()
doesn’t
support, like the plain text SwissProt format? Here is an alternative
solution using the get_raw()
method added to Bio.SeqIO.index()
in Biopython 1.54 (see Section [sec:seqio-index-getraw]).
from Bio import SeqIO
#Get the lengths and ids, and sort on length
len_and_ids = sorted((len(rec), rec.id) for rec in \
SeqIO.parse("ls_orchid.fasta","fasta"))
ids = reversed([id for (length, id) in len_and_ids])
del len_and_ids #free this memory
record_index = SeqIO.index("ls_orchid.fasta", "fasta")
handle = open("sorted.fasta", "w")
for id in ids:
handle.write(record_index.get_raw(id))
handle.close()
As a bonus, because it doesn’t parse the data into SeqRecord
objects
a second time it should be faster.
Simple quality filtering for FASTQ files¶
The FASTQ file format was introduced at Sanger and is now widely used
for holding nucleotide sequencing reads together with their quality
scores. FASTQ files (and the related QUAL files) are an excellent
example of per-letter-annotation, because for each nucleotide in the
sequence there is an associated quality score. Any per-letter-annotation
is held in a SeqRecord
in the letter_annotations
dictionary as a
list, tuple or string (with the same number of elements as the sequence
length).
One common task is taking a large set of sequencing reads and filtering
them (or cropping them) based on their quality scores. The following
example is very simplistic, but should illustrate the basics of working
with quality data in a SeqRecord
object. All we are going to do here
is read in a file of FASTQ data, and filter it to pick out only those
records whose PHRED quality scores are all above some threshold (here
20).
For this example we’ll use some real data downloaded from the ENA
sequence read archive,
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR020/SRR020192/SRR020192.fastq.gz
(2MB) which unzips to a 19MB file SRR020192.fastq
. This is some
Roche 454 GS FLX single end data from virus infected California sea
lions (see http://www.ebi.ac.uk/ena/data/view/SRS004476 for details).
First, let’s count the reads:
from Bio import SeqIO
count = 0
for rec in SeqIO.parse("SRR020192.fastq", "fastq"):
count += 1
print("%i reads" % count)
Now let’s do a simple filtering for a minimum PHRED quality of 20:
from Bio import SeqIO
good_reads = (rec for rec in \
SeqIO.parse("SRR020192.fastq", "fastq") \
if min(rec.letter_annotations["phred_quality"]) >= 20)
count = SeqIO.write(good_reads, "good_quality.fastq", "fastq")
print("Saved %i reads" % count)
This pulled out only \(14580\) reads out of the \(41892\) present. A more sensible thing to do would be to quality trim the reads, but this is intended as an example only.
FASTQ files can contain millions of entries, so it is best to avoid
loading them all into memory at once. This example uses a generator
expression, which means only one SeqRecord
is created at a time -
avoiding any memory limitations.
Trimming off primer sequences¶
For this example we’re going to pretend that GATGACGGTGT
is a 5’
primer sequence we want to look for in some FASTQ formatted read data.
As in the example above, we’ll use the SRR020192.fastq
file
downloaded from the ENA
(ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR020/SRR020192/SRR020192.fastq.gz).
The same approach would work with any other supported file format (e.g.
FASTA files).
This code uses Bio.SeqIO
with a generator expression (to avoid
loading all the sequences into memory at once), and the Seq
object’s
startswith
method to see if the read starts with the primer
sequence:
from Bio import SeqIO
primer_reads = (rec for rec in \
SeqIO.parse("SRR020192.fastq", "fastq") \
if rec.seq.startswith("GATGACGGTGT"))
count = SeqIO.write(primer_reads, "with_primer.fastq", "fastq")
print("Saved %i reads" % count)
That should find \(13819\) reads from SRR014849.fastq
and save
them to a new FASTQ file, with_primer.fastq
.
Now suppose that instead you wanted to make a FASTQ file containing
these reads but with the primer sequence removed? That’s just a small
change as we can slice the SeqRecord
(see
Section [sec:SeqRecord-slicing]) to remove the first eleven letters (the
length of our primer):
from Bio import SeqIO
trimmed_primer_reads = (rec[11:] for rec in \
SeqIO.parse("SRR020192.fastq", "fastq") \
if rec.seq.startswith("GATGACGGTGT"))
count = SeqIO.write(trimmed_primer_reads, "with_primer_trimmed.fastq", "fastq")
print("Saved %i reads" % count)
Again, that should pull out the \(13819\) reads from
SRR020192.fastq
, but this time strip off the first ten characters,
and save them to another new FASTQ file, with_primer_trimmed.fastq
.
Finally, suppose you want to create a new FASTQ file where these reads have their primer removed, but all the other reads are kept as they were? If we want to still use a generator expression, it is probably clearest to define our own trim function:
from Bio import SeqIO
def trim_primer(record, primer):
if record.seq.startswith(primer):
return record[len(primer):]
else:
return record
trimmed_reads = (trim_primer(record, "GATGACGGTGT") for record in \
SeqIO.parse("SRR020192.fastq", "fastq"))
count = SeqIO.write(trimmed_reads, "trimmed.fastq", "fastq")
print("Saved %i reads" % count)
This takes longer, as this time the output file contains all \(41892\) reads. Again, we’re used a generator expression to avoid any memory problems. You could alternatively use a generator function rather than a generator expression.
from Bio import SeqIO
def trim_primers(records, primer):
"""Removes perfect primer sequences at start of reads.
This is a generator function, the records argument should
be a list or iterator returning SeqRecord objects.
"""
len_primer = len(primer) #cache this for later
for record in records:
if record.seq.startswith(primer):
yield record[len_primer:]
else:
yield record
original_reads = SeqIO.parse("SRR020192.fastq", "fastq")
trimmed_reads = trim_primers(original_reads, "GATGACGGTGT")
count = SeqIO.write(trimmed_reads, "trimmed.fastq", "fastq")
print("Saved %i reads" % count)
This form is more flexible if you want to do something more complicated where only some of the records are retained – as shown in the next example.
Trimming off adaptor sequences¶
This is essentially a simple extension to the previous example. We are
going to going to pretend GATGACGGTGT
is an adaptor sequence in some
FASTQ formatted read data, again the SRR020192.fastq
file from the
NCBI
(ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR020/SRR020192/SRR020192.fastq.gz).
This time however, we will look for the sequence anywhere in the reads, not just at the very beginning:
from Bio import SeqIO
def trim_adaptors(records, adaptor):
"""Trims perfect adaptor sequences.
This is a generator function, the records argument should
be a list or iterator returning SeqRecord objects.
"""
len_adaptor = len(adaptor) #cache this for later
for record in records:
index = record.seq.find(adaptor)
if index == -1:
#adaptor not found, so won't trim
yield record
else:
#trim off the adaptor
yield record[index+len_adaptor:]
original_reads = SeqIO.parse("SRR020192.fastq", "fastq")
trimmed_reads = trim_adaptors(original_reads, "GATGACGGTGT")
count = SeqIO.write(trimmed_reads, "trimmed.fastq", "fastq")
print("Saved %i reads" % count)
Because we are using a FASTQ input file in this example, the
SeqRecord
objects have per-letter-annotation for the quality scores.
By slicing the SeqRecord
object the appropriate scores are used on
the trimmed records, so we can output them as a FASTQ file too.
Compared to the output of the previous example where we only looked for a primer/adaptor at the start of each read, you may find some of the trimmed reads are quite short after trimming (e.g. if the adaptor was found in the middle rather than near the start). So, let’s add a minimum length requirement as well:
from Bio import SeqIO
def trim_adaptors(records, adaptor, min_len):
"""Trims perfect adaptor sequences, checks read length.
This is a generator function, the records argument should
be a list or iterator returning SeqRecord objects.
"""
len_adaptor = len(adaptor) #cache this for later
for record in records:
len_record = len(record) #cache this for later
if len(record) < min_len:
#Too short to keep
continue
index = record.seq.find(adaptor)
if index == -1:
#adaptor not found, so won't trim
yield record
elif len_record - index - len_adaptor >= min_len:
#after trimming this will still be long enough
yield record[index+len_adaptor:]
original_reads = SeqIO.parse("SRR020192.fastq", "fastq")
trimmed_reads = trim_adaptors(original_reads, "GATGACGGTGT", 100)
count = SeqIO.write(trimmed_reads, "trimmed.fastq", "fastq")
print("Saved %i reads" % count)
By changing the format names, you could apply this to FASTA files instead. This code also could be extended to do a fuzzy match instead of an exact match (maybe using a pairwise alignment, or taking into account the read quality scores), but that will be much slower.
Converting FASTQ files¶
Back in Section [sec:SeqIO-conversion] we showed how to use
Bio.SeqIO
to convert between two file formats. Here we’ll go into a
little more detail regarding FASTQ files which are used in second
generation DNA sequencing. Please refer to Cock et al. (2009)
@cock2010 for a longer description. FASTQ files store both the DNA
sequence (as a string) and the associated read qualities.
PHRED scores (used in most FASTQ files, and also in QUAL files, ACE files and SFF files) have become a de facto standard for representing the probability of a sequencing error (here denoted by \(P_e\)) at a given base using a simple base ten log transformation:
This means a wrong read (\(P_e = 1\)) gets a PHRED quality of \(0\), while a very good read like \(P_e = 0.00001\) gets a PHRED quality of \(50\). While for raw sequencing data qualities higher than this are rare, with post processing such as read mapping or assembly, qualities of up to about \(90\) are possible (indeed, the MAQ tool allows for PHRED scores in the range 0 to 93 inclusive).
The FASTQ format has the potential to become a de facto standard for storing the letters and quality scores for a sequencing read in a single plain text file. The only fly in the ointment is that there are at least three versions of the FASTQ format which are incompatible and difficult to distinguish...
- The original Sanger FASTQ format uses PHRED qualities encoded with an
ASCII offset of 33. The NCBI are using this format in their Short
Read Archive. We call this the
fastq
(orfastq-sanger
) format inBio.SeqIO
. - Solexa (later bought by Illumina) introduced their own version using
Solexa qualities encoded with an ASCII offset of 64. We call this the
fastq-solexa
format. - Illumina pipeline 1.3 onwards produces FASTQ files with PHRED
qualities (which is more consistent), but encoded with an ASCII
offset of 64. We call this the
fastq-illumina
format.
The Solexa quality scores are defined using a different log transformation:
Given Solexa/Illumina have now moved to using PHRED scores in version
1.3 of their pipeline, the Solexa quality scores will gradually fall out
of use. If you equate the error estimates (\(P_e\)) these two
equations allow conversion between the two scoring systems - and
Biopython includes functions to do this in the Bio.SeqIO.QualityIO
module, which are called if you use Bio.SeqIO
to convert an old
Solexa/Illumina file into a standard Sanger FASTQ file:
from Bio import SeqIO
SeqIO.convert("solexa.fastq", "fastq-solexa", "standard.fastq", "fastq")
If you want to convert a new Illumina 1.3+ FASTQ file, all that gets changed is the ASCII offset because although encoded differently the scores are all PHRED qualities:
from Bio import SeqIO
SeqIO.convert("illumina.fastq", "fastq-illumina", "standard.fastq", "fastq")
Note that using Bio.SeqIO.convert()
like this is much faster than
combining Bio.SeqIO.parse()
and Bio.SeqIO.write()
because
optimised code is used for converting between FASTQ variants (and also
for FASTQ to FASTA conversion).
For good quality reads, PHRED and Solexa scores are approximately equal,
which means since both the fasta-solexa
and fastq-illumina
formats use an ASCII offset of 64 the files are almost the same. This
was a deliberate design choice by Illumina, meaning applications
expecting the old fasta-solexa
style files will probably be OK using
the newer fastq-illumina
files (on good data). Of course, both
variants are very different from the original FASTQ standard as used by
Sanger, the NCBI, and elsewhere (format name fastq
or
fastq-sanger
).
For more details, see the built in help (also online):
In [5]:
from Bio.SeqIO import QualityIO
help(QualityIO)
Help on module Bio.SeqIO.QualityIO in Bio.SeqIO:
NAME
Bio.SeqIO.QualityIO - Bio.SeqIO support for the FASTQ and QUAL file formats.
DESCRIPTION
Note that you are expected to use this code via the Bio.SeqIO interface, as
shown below.
The FASTQ file format is used frequently at the Wellcome Trust Sanger Institute
to bundle a FASTA sequence and its PHRED quality data (integers between 0 and
90). Rather than using a single FASTQ file, often paired FASTA and QUAL files
are used containing the sequence and the quality information separately.
The PHRED software reads DNA sequencing trace files, calls bases, and
assigns a non-negative quality value to each called base using a logged
transformation of the error probability, Q = -10 log10( Pe ), for example::
Pe = 1.0, Q = 0
Pe = 0.1, Q = 10
Pe = 0.01, Q = 20
...
Pe = 0.00000001, Q = 80
Pe = 0.000000001, Q = 90
In typical raw sequence reads, the PHRED quality valuea will be from 0 to 40.
In the QUAL format these quality values are held as space separated text in
a FASTA like file format. In the FASTQ format, each quality values is encoded
with a single ASCI character using chr(Q+33), meaning zero maps to the
character "!" and for example 80 maps to "q". For the Sanger FASTQ standard
the allowed range of PHRED scores is 0 to 93 inclusive. The sequences and
quality are then stored in pairs in a FASTA like format.
Unfortunately there is no official document describing the FASTQ file format,
and worse, several related but different variants exist. For more details,
please read this open access publication::
The Sanger FASTQ file format for sequences with quality scores, and the
Solexa/Illumina FASTQ variants.
P.J.A.Cock (Biopython), C.J.Fields (BioPerl), N.Goto (BioRuby),
M.L.Heuer (BioJava) and P.M. Rice (EMBOSS).
Nucleic Acids Research 2010 38(6):1767-1771
http://dx.doi.org/10.1093/nar/gkp1137
The good news is that Roche 454 sequencers can output files in the QUAL format,
and sensibly they use PHREP style scores like Sanger. Converting a pair of
FASTA and QUAL files into a Sanger style FASTQ file is easy. To extract QUAL
files from a Roche 454 SFF binary file, use the Roche off instrument command
line tool "sffinfo" with the -q or -qual argument. You can extract a matching
FASTA file using the -s or -seq argument instead.
The bad news is that Solexa/Illumina did things differently - they have their
own scoring system AND their own incompatible versions of the FASTQ format.
Solexa/Illumina quality scores use Q = - 10 log10 ( Pe / (1-Pe) ), which can
be negative. PHRED scores and Solexa scores are NOT interchangeable (but a
reasonable mapping can be achieved between them, and they are approximately
equal for higher quality reads).
Confusingly early Solexa pipelines produced a FASTQ like file but using their
own score mapping and an ASCII offset of 64. To make things worse, for the
Solexa/Illumina pipeline 1.3 onwards, they introduced a third variant of the
FASTQ file format, this time using PHRED scores (which is more consistent) but
with an ASCII offset of 64.
i.e. There are at least THREE different and INCOMPATIBLE variants of the FASTQ
file format: The original Sanger PHRED standard, and two from Solexa/Illumina.
The good news is that as of CASAVA version 1.8, Illumina sequencers will
produce FASTQ files using the standard Sanger encoding.
You are expected to use this module via the Bio.SeqIO functions, with the
following format names:
- "qual" means simple quality files using PHRED scores (e.g. from Roche 454)
- "fastq" means Sanger style FASTQ files using PHRED scores and an ASCII
offset of 33 (e.g. from the NCBI Short Read Archive and Illumina 1.8+).
These can potentially hold PHRED scores from 0 to 93.
- "fastq-sanger" is an alias for "fastq".
- "fastq-solexa" means old Solexa (and also very early Illumina) style FASTQ
files, using Solexa scores with an ASCII offset 64. These can hold Solexa
scores from -5 to 62.
- "fastq-illumina" means newer Illumina 1.3 to 1.7 style FASTQ files, using
PHRED scores but with an ASCII offset 64, allowing PHRED scores from 0
to 62.
We could potentially add support for "qual-solexa" meaning QUAL files which
contain Solexa scores, but thus far there isn't any reason to use such files.
For example, consider the following short FASTQ file::
@EAS54_6_R1_2_1_413_324
CCCTTCTTGTCTTCAGCGTTTCTCC
+
;;3;;;;;;;;;;;;7;;;;;;;88
@EAS54_6_R1_2_1_540_792
TTGGCAGGCCAAGGCCGATGGATCA
+
;;;;;;;;;;;7;;;;;-;;;3;83
@EAS54_6_R1_2_1_443_348
GTTGCTTCTGGCGTGGGTGGGGGGG
+
;;;;;;;;;;;9;7;;.7;393333
This contains three reads of length 25. From the read length these were
probably originally from an early Solexa/Illumina sequencer but this file
follows the Sanger FASTQ convention (PHRED style qualities with an ASCII
offet of 33). This means we can parse this file using Bio.SeqIO using
"fastq" as the format name:
>>> from Bio import SeqIO
>>> for record in SeqIO.parse("Quality/example.fastq", "fastq"):
... print("%s %s" % (record.id, record.seq))
EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC
EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA
EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG
The qualities are held as a list of integers in each record's annotation:
>>> print(record)
ID: EAS54_6_R1_2_1_443_348
Name: EAS54_6_R1_2_1_443_348
Description: EAS54_6_R1_2_1_443_348
Number of features: 0
Per letter annotation for: phred_quality
Seq('GTTGCTTCTGGCGTGGGTGGGGGGG', SingleLetterAlphabet())
>>> print(record.letter_annotations["phred_quality"])
[26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18]
You can use the SeqRecord format method to show this in the QUAL format:
>>> print(record.format("qual"))
>EAS54_6_R1_2_1_443_348
26 26 26 26 26 26 26 26 26 26 26 24 26 22 26 26 13 22 26 18
24 18 18 18 18
<BLANKLINE>
Or go back to the FASTQ format, use "fastq" (or "fastq-sanger"):
>>> print(record.format("fastq"))
@EAS54_6_R1_2_1_443_348
GTTGCTTCTGGCGTGGGTGGGGGGG
+
;;;;;;;;;;;9;7;;.7;393333
<BLANKLINE>
Or, using the Illumina 1.3+ FASTQ encoding (PHRED values with an ASCII offset
of 64):
>>> print(record.format("fastq-illumina"))
@EAS54_6_R1_2_1_443_348
GTTGCTTCTGGCGTGGGTGGGGGGG
+
ZZZZZZZZZZZXZVZZMVZRXRRRR
<BLANKLINE>
You can also get Biopython to convert the scores and show a Solexa style
FASTQ file:
>>> print(record.format("fastq-solexa"))
@EAS54_6_R1_2_1_443_348
GTTGCTTCTGGCGTGGGTGGGGGGG
+
ZZZZZZZZZZZXZVZZMVZRXRRRR
<BLANKLINE>
Notice that this is actually the same output as above using "fastq-illumina"
as the format! The reason for this is all these scores are high enough that
the PHRED and Solexa scores are almost equal. The differences become apparent
for poor quality reads. See the functions solexa_quality_from_phred and
phred_quality_from_solexa for more details.
If you wanted to trim your sequences (perhaps to remove low quality regions,
or to remove a primer sequence), try slicing the SeqRecord objects. e.g.
>>> sub_rec = record[5:15]
>>> print(sub_rec)
ID: EAS54_6_R1_2_1_443_348
Name: EAS54_6_R1_2_1_443_348
Description: EAS54_6_R1_2_1_443_348
Number of features: 0
Per letter annotation for: phred_quality
Seq('TTCTGGCGTG', SingleLetterAlphabet())
>>> print(sub_rec.letter_annotations["phred_quality"])
[26, 26, 26, 26, 26, 26, 24, 26, 22, 26]
>>> print(sub_rec.format("fastq"))
@EAS54_6_R1_2_1_443_348
TTCTGGCGTG
+
;;;;;;9;7;
<BLANKLINE>
If you wanted to, you could read in this FASTQ file, and save it as a QUAL file:
>>> from Bio import SeqIO
>>> record_iterator = SeqIO.parse("Quality/example.fastq", "fastq")
>>> with open("Quality/temp.qual", "w") as out_handle:
... SeqIO.write(record_iterator, out_handle, "qual")
3
You can of course read in a QUAL file, such as the one we just created:
>>> from Bio import SeqIO
>>> for record in SeqIO.parse("Quality/temp.qual", "qual"):
... print("%s %s" % (record.id, record.seq))
EAS54_6_R1_2_1_413_324 ?????????????????????????
EAS54_6_R1_2_1_540_792 ?????????????????????????
EAS54_6_R1_2_1_443_348 ?????????????????????????
Notice that QUAL files don't have a proper sequence present! But the quality
information is there:
>>> print(record)
ID: EAS54_6_R1_2_1_443_348
Name: EAS54_6_R1_2_1_443_348
Description: EAS54_6_R1_2_1_443_348
Number of features: 0
Per letter annotation for: phred_quality
UnknownSeq(25, alphabet = SingleLetterAlphabet(), character = '?')
>>> print(record.letter_annotations["phred_quality"])
[26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18]
Just to keep things tidy, if you are following this example yourself, you can
delete this temporary file now:
>>> import os
>>> os.remove("Quality/temp.qual")
Sometimes you won't have a FASTQ file, but rather just a pair of FASTA and QUAL
files. Because the Bio.SeqIO system is designed for reading single files, you
would have to read the two in separately and then combine the data. However,
since this is such a common thing to want to do, there is a helper iterator
defined in this module that does this for you - PairedFastaQualIterator.
Alternatively, if you have enough RAM to hold all the records in memory at once,
then a simple dictionary approach would work:
>>> from Bio import SeqIO
>>> reads = SeqIO.to_dict(SeqIO.parse("Quality/example.fasta", "fasta"))
>>> for rec in SeqIO.parse("Quality/example.qual", "qual"):
... reads[rec.id].letter_annotations["phred_quality"]=rec.letter_annotations["phred_quality"]
You can then access any record by its key, and get both the sequence and the
quality scores.
>>> print(reads["EAS54_6_R1_2_1_540_792"].format("fastq"))
@EAS54_6_R1_2_1_540_792
TTGGCAGGCCAAGGCCGATGGATCA
+
;;;;;;;;;;;7;;;;;-;;;3;83
<BLANKLINE>
It is important that you explicitly tell Bio.SeqIO which FASTQ variant you are
using ("fastq" or "fastq-sanger" for the Sanger standard using PHRED values,
"fastq-solexa" for the original Solexa/Illumina variant, or "fastq-illumina"
for the more recent variant), as this cannot be detected reliably
automatically.
To illustrate this problem, let's consider an artifical example:
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import generic_dna
>>> from Bio.SeqRecord import SeqRecord
>>> test = SeqRecord(Seq("NACGTACGTA", generic_dna), id="Test",
... description="Made up!")
>>> print(test.format("fasta"))
>Test Made up!
NACGTACGTA
<BLANKLINE>
>>> print(test.format("fastq"))
Traceback (most recent call last):
...
ValueError: No suitable quality scores found in letter_annotations of SeqRecord (id=Test).
We created a sample SeqRecord, and can show it in FASTA format - but for QUAL
or FASTQ format we need to provide some quality scores. These are held as a
list of integers (one for each base) in the letter_annotations dictionary:
>>> test.letter_annotations["phred_quality"] = [0, 1, 2, 3, 4, 5, 10, 20, 30, 40]
>>> print(test.format("qual"))
>Test Made up!
0 1 2 3 4 5 10 20 30 40
<BLANKLINE>
>>> print(test.format("fastq"))
@Test Made up!
NACGTACGTA
+
!"#$%&+5?I
<BLANKLINE>
We can check this FASTQ encoding - the first PHRED quality was zero, and this
mapped to a exclamation mark, while the final score was 40 and this mapped to
the letter "I":
>>> ord('!') - 33
0
>>> ord('I') - 33
40
>>> [ord(letter)-33 for letter in '!"#$%&+5?I']
[0, 1, 2, 3, 4, 5, 10, 20, 30, 40]
Similarly, we could produce an Illumina 1.3 to 1.7 style FASTQ file using PHRED
scores with an offset of 64:
>>> print(test.format("fastq-illumina"))
@Test Made up!
NACGTACGTA
+
@ABCDEJT^h
<BLANKLINE>
And we can check this too - the first PHRED score was zero, and this mapped to
"@", while the final score was 40 and this mapped to "h":
>>> ord("@") - 64
0
>>> ord("h") - 64
40
>>> [ord(letter)-64 for letter in "@ABCDEJT^h"]
[0, 1, 2, 3, 4, 5, 10, 20, 30, 40]
Notice how different the standard Sanger FASTQ and the Illumina 1.3 to 1.7 style
FASTQ files look for the same data! Then we have the older Solexa/Illumina
format to consider which encodes Solexa scores instead of PHRED scores.
First let's see what Biopython says if we convert the PHRED scores into Solexa
scores (rounding to one decimal place):
>>> for q in [0, 1, 2, 3, 4, 5, 10, 20, 30, 40]:
... print("PHRED %i maps to Solexa %0.1f" % (q, solexa_quality_from_phred(q)))
PHRED 0 maps to Solexa -5.0
PHRED 1 maps to Solexa -5.0
PHRED 2 maps to Solexa -2.3
PHRED 3 maps to Solexa -0.0
PHRED 4 maps to Solexa 1.8
PHRED 5 maps to Solexa 3.3
PHRED 10 maps to Solexa 9.5
PHRED 20 maps to Solexa 20.0
PHRED 30 maps to Solexa 30.0
PHRED 40 maps to Solexa 40.0
Now here is the record using the old Solexa style FASTQ file:
>>> print(test.format("fastq-solexa"))
@Test Made up!
NACGTACGTA
+
;;>@BCJT^h
<BLANKLINE>
Again, this is using an ASCII offset of 64, so we can check the Solexa scores:
>>> [ord(letter)-64 for letter in ";;>@BCJT^h"]
[-5, -5, -2, 0, 2, 3, 10, 20, 30, 40]
This explains why the last few letters of this FASTQ output matched that using
the Illumina 1.3 to 1.7 format - high quality PHRED scores and Solexa scores
are approximately equal.
CLASSES
Bio.SeqIO.Interfaces.SequentialSequenceWriter(Bio.SeqIO.Interfaces.SequenceWriter)
FastqIlluminaWriter
FastqPhredWriter
FastqSolexaWriter
QualPhredWriter
class FastqIlluminaWriter(Bio.SeqIO.Interfaces.SequentialSequenceWriter)
| Write Illumina 1.3+ FASTQ format files (with PHRED quality scores).
|
| This outputs FASTQ files like those from the Solexa/Illumina 1.3+ pipeline,
| using PHRED scores and an ASCII offset of 64. Note these files are NOT
| compatible with the standard Sanger style PHRED FASTQ files which use an
| ASCII offset of 32.
|
| Although you can use this class directly, you are strongly encouraged to
| use the Bio.SeqIO.write() function with format name "fastq-illumina"
| instead. This code is also called if you use the .format("fastq-illumina")
| method of a SeqRecord. For example,
|
| >>> from Bio import SeqIO
| >>> record = SeqIO.read("Quality/sanger_faked.fastq", "fastq-sanger")
| >>> print(record.format("fastq-illumina"))
| @Test PHRED qualities from 40 to 0 inclusive
| ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN
| +
| hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@
| <BLANKLINE>
|
| Note that Illumina FASTQ files have an upper limit of PHRED quality 62, which is
| encoded as ASCII 126, the tilde. If your quality scores are truncated to fit, a
| warning is issued.
|
| Method resolution order:
| FastqIlluminaWriter
| Bio.SeqIO.Interfaces.SequentialSequenceWriter
| Bio.SeqIO.Interfaces.SequenceWriter
| builtins.object
|
| Methods defined here:
|
| write_record(self, record)
| Write a single FASTQ record to the file.
|
| ----------------------------------------------------------------------
| Methods inherited from Bio.SeqIO.Interfaces.SequentialSequenceWriter:
|
| __init__(self, handle)
| Creates the writer object.
|
| Use the method write_file() to actually record your sequence records.
|
| write_file(self, records)
| Use this to write an entire file containing the given records.
|
| records - A list or iterator returning SeqRecord objects
|
| This method can only be called once. Returns the number of records
| written.
|
| write_footer(self)
|
| write_header(self)
|
| write_records(self, records)
| Write multiple record to the output file.
|
| records - A list or iterator returning SeqRecord objects
|
| Once you have called write_header() you can call write_record()
| and/or write_records() as many times as needed. Then call
| write_footer() and close().
|
| Returns the number of records written.
|
| ----------------------------------------------------------------------
| Methods inherited from Bio.SeqIO.Interfaces.SequenceWriter:
|
| clean(self, text)
| Use this to avoid getting newlines in the output.
|
| ----------------------------------------------------------------------
| Data descriptors inherited from Bio.SeqIO.Interfaces.SequenceWriter:
|
| __dict__
| dictionary for instance variables (if defined)
|
| __weakref__
| list of weak references to the object (if defined)
class FastqPhredWriter(Bio.SeqIO.Interfaces.SequentialSequenceWriter)
| Class to write standard FASTQ format files (using PHRED quality scores).
|
| Although you can use this class directly, you are strongly encouraged
| to use the Bio.SeqIO.write() function instead via the format name "fastq"
| or the alias "fastq-sanger". For example, this code reads in a standard
| Sanger style FASTQ file (using PHRED scores) and re-saves it as another
| Sanger style FASTQ file:
|
| >>> from Bio import SeqIO
| >>> record_iterator = SeqIO.parse("Quality/example.fastq", "fastq")
| >>> with open("Quality/temp.fastq", "w") as out_handle:
| ... SeqIO.write(record_iterator, out_handle, "fastq")
| 3
|
| You might want to do this if the original file included extra line breaks,
| which while valid may not be supported by all tools. The output file from
| Biopython will have each sequence on a single line, and each quality
| string on a single line (which is considered desirable for maximum
| compatibility).
|
| In this next example, an old style Solexa/Illumina FASTQ file (using Solexa
| quality scores) is converted into a standard Sanger style FASTQ file using
| PHRED qualities:
|
| >>> from Bio import SeqIO
| >>> record_iterator = SeqIO.parse("Quality/solexa_example.fastq", "fastq-solexa")
| >>> with open("Quality/temp.fastq", "w") as out_handle:
| ... SeqIO.write(record_iterator, out_handle, "fastq")
| 5
|
| This code is also called if you use the .format("fastq") method of a
| SeqRecord, or .format("fastq-sanger") if you prefer that alias.
|
| Note that Sanger FASTQ files have an upper limit of PHRED quality 93, which is
| encoded as ASCII 126, the tilde. If your quality scores are truncated to fit, a
| warning is issued.
|
| P.S. To avoid cluttering up your working directory, you can delete this
| temporary file now:
|
| >>> import os
| >>> os.remove("Quality/temp.fastq")
|
| Method resolution order:
| FastqPhredWriter
| Bio.SeqIO.Interfaces.SequentialSequenceWriter
| Bio.SeqIO.Interfaces.SequenceWriter
| builtins.object
|
| Methods defined here:
|
| write_record(self, record)
| Write a single FASTQ record to the file.
|
| ----------------------------------------------------------------------
| Methods inherited from Bio.SeqIO.Interfaces.SequentialSequenceWriter:
|
| __init__(self, handle)
| Creates the writer object.
|
| Use the method write_file() to actually record your sequence records.
|
| write_file(self, records)
| Use this to write an entire file containing the given records.
|
| records - A list or iterator returning SeqRecord objects
|
| This method can only be called once. Returns the number of records
| written.
|
| write_footer(self)
|
| write_header(self)
|
| write_records(self, records)
| Write multiple record to the output file.
|
| records - A list or iterator returning SeqRecord objects
|
| Once you have called write_header() you can call write_record()
| and/or write_records() as many times as needed. Then call
| write_footer() and close().
|
| Returns the number of records written.
|
| ----------------------------------------------------------------------
| Methods inherited from Bio.SeqIO.Interfaces.SequenceWriter:
|
| clean(self, text)
| Use this to avoid getting newlines in the output.
|
| ----------------------------------------------------------------------
| Data descriptors inherited from Bio.SeqIO.Interfaces.SequenceWriter:
|
| __dict__
| dictionary for instance variables (if defined)
|
| __weakref__
| list of weak references to the object (if defined)
class FastqSolexaWriter(Bio.SeqIO.Interfaces.SequentialSequenceWriter)
| Write old style Solexa/Illumina FASTQ format files (with Solexa qualities).
|
| This outputs FASTQ files like those from the early Solexa/Illumina
| pipeline, using Solexa scores and an ASCII offset of 64. These are
| NOT compatible with the standard Sanger style PHRED FASTQ files.
|
| If your records contain a "solexa_quality" entry under letter_annotations,
| this is used, otherwise any "phred_quality" entry will be used after
| conversion using the solexa_quality_from_phred function. If neither style
| of quality scores are present, an exception is raised.
|
| Although you can use this class directly, you are strongly encouraged
| to use the Bio.SeqIO.write() function instead. For example, this code
| reads in a FASTQ file and re-saves it as another FASTQ file:
|
| >>> from Bio import SeqIO
| >>> record_iterator = SeqIO.parse("Quality/solexa_example.fastq", "fastq-solexa")
| >>> with open("Quality/temp.fastq", "w") as out_handle:
| ... SeqIO.write(record_iterator, out_handle, "fastq-solexa")
| 5
|
| You might want to do this if the original file included extra line breaks,
| which (while valid) may not be supported by all tools. The output file
| from Biopython will have each sequence on a single line, and each quality
| string on a single line (which is considered desirable for maximum
| compatibility).
|
| This code is also called if you use the .format("fastq-solexa") method of
| a SeqRecord. For example,
|
| >>> record = SeqIO.read("Quality/sanger_faked.fastq", "fastq-sanger")
| >>> print(record.format("fastq-solexa"))
| @Test PHRED qualities from 40 to 0 inclusive
| ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN
| +
| hgfedcba`_^]\[ZYXWVUTSRQPONMLKJHGFECB@>;;
| <BLANKLINE>
|
| Note that Solexa FASTQ files have an upper limit of Solexa quality 62, which is
| encoded as ASCII 126, the tilde. If your quality scores must be truncated to fit,
| a warning is issued.
|
| P.S. Don't forget to delete the temp file if you don't need it anymore:
|
| >>> import os
| >>> os.remove("Quality/temp.fastq")
|
| Method resolution order:
| FastqSolexaWriter
| Bio.SeqIO.Interfaces.SequentialSequenceWriter
| Bio.SeqIO.Interfaces.SequenceWriter
| builtins.object
|
| Methods defined here:
|
| write_record(self, record)
| Write a single FASTQ record to the file.
|
| ----------------------------------------------------------------------
| Methods inherited from Bio.SeqIO.Interfaces.SequentialSequenceWriter:
|
| __init__(self, handle)
| Creates the writer object.
|
| Use the method write_file() to actually record your sequence records.
|
| write_file(self, records)
| Use this to write an entire file containing the given records.
|
| records - A list or iterator returning SeqRecord objects
|
| This method can only be called once. Returns the number of records
| written.
|
| write_footer(self)
|
| write_header(self)
|
| write_records(self, records)
| Write multiple record to the output file.
|
| records - A list or iterator returning SeqRecord objects
|
| Once you have called write_header() you can call write_record()
| and/or write_records() as many times as needed. Then call
| write_footer() and close().
|
| Returns the number of records written.
|
| ----------------------------------------------------------------------
| Methods inherited from Bio.SeqIO.Interfaces.SequenceWriter:
|
| clean(self, text)
| Use this to avoid getting newlines in the output.
|
| ----------------------------------------------------------------------
| Data descriptors inherited from Bio.SeqIO.Interfaces.SequenceWriter:
|
| __dict__
| dictionary for instance variables (if defined)
|
| __weakref__
| list of weak references to the object (if defined)
class QualPhredWriter(Bio.SeqIO.Interfaces.SequentialSequenceWriter)
| Class to write QUAL format files (using PHRED quality scores).
|
| Although you can use this class directly, you are strongly encouraged
| to use the Bio.SeqIO.write() function instead. For example, this code
| reads in a FASTQ file and saves the quality scores into a QUAL file:
|
| >>> from Bio import SeqIO
| >>> record_iterator = SeqIO.parse("Quality/example.fastq", "fastq")
| >>> with open("Quality/temp.qual", "w") as out_handle:
| ... SeqIO.write(record_iterator, out_handle, "qual")
| 3
|
| This code is also called if you use the .format("qual") method of a
| SeqRecord.
|
| P.S. Don't forget to clean up the temp file if you don't need it anymore:
|
| >>> import os
| >>> os.remove("Quality/temp.qual")
|
| Method resolution order:
| QualPhredWriter
| Bio.SeqIO.Interfaces.SequentialSequenceWriter
| Bio.SeqIO.Interfaces.SequenceWriter
| builtins.object
|
| Methods defined here:
|
| __init__(self, handle, wrap=60, record2title=None)
| Create a QUAL writer.
|
| Arguments:
| - handle - Handle to an output file, e.g. as returned
| by open(filename, "w")
| - wrap - Optional line length used to wrap sequence lines.
| Defaults to wrapping the sequence at 60 characters
| Use zero (or None) for no wrapping, giving a single
| long line for the sequence.
| - record2title - Optional function to return the text to be
| used for the title line of each record. By default
| a combination of the record.id and record.description
| is used. If the record.description starts with the
| record.id, then just the record.description is used.
|
| The record2title argument is present for consistency with the
| Bio.SeqIO.FastaIO writer class.
|
| write_record(self, record)
| Write a single QUAL record to the file.
|
| ----------------------------------------------------------------------
| Methods inherited from Bio.SeqIO.Interfaces.SequentialSequenceWriter:
|
| write_file(self, records)
| Use this to write an entire file containing the given records.
|
| records - A list or iterator returning SeqRecord objects
|
| This method can only be called once. Returns the number of records
| written.
|
| write_footer(self)
|
| write_header(self)
|
| write_records(self, records)
| Write multiple record to the output file.
|
| records - A list or iterator returning SeqRecord objects
|
| Once you have called write_header() you can call write_record()
| and/or write_records() as many times as needed. Then call
| write_footer() and close().
|
| Returns the number of records written.
|
| ----------------------------------------------------------------------
| Methods inherited from Bio.SeqIO.Interfaces.SequenceWriter:
|
| clean(self, text)
| Use this to avoid getting newlines in the output.
|
| ----------------------------------------------------------------------
| Data descriptors inherited from Bio.SeqIO.Interfaces.SequenceWriter:
|
| __dict__
| dictionary for instance variables (if defined)
|
| __weakref__
| list of weak references to the object (if defined)
FUNCTIONS
FastqGeneralIterator(handle)
Iterate over Fastq records as string tuples (not as SeqRecord objects).
This code does not try to interpret the quality string numerically. It
just returns tuples of the title, sequence and quality as strings. For
the sequence and quality, any whitespace (such as new lines) is removed.
Our SeqRecord based FASTQ iterators call this function internally, and then
turn the strings into a SeqRecord objects, mapping the quality string into
a list of numerical scores. If you want to do a custom quality mapping,
then you might consider calling this function directly.
For parsing FASTQ files, the title string from the "@" line at the start
of each record can optionally be omitted on the "+" lines. If it is
repeated, it must be identical.
The sequence string and the quality string can optionally be split over
multiple lines, although several sources discourage this. In comparison,
for the FASTA file format line breaks between 60 and 80 characters are
the norm.
**WARNING** - Because the "@" character can appear in the quality string,
this can cause problems as this is also the marker for the start of
a new sequence. In fact, the "+" sign can also appear as well. Some
sources recommended having no line breaks in the quality to avoid this,
but even that is not enough, consider this example::
@071113_EAS56_0053:1:1:998:236
TTTCTTGCCCCCATAGACTGAGACCTTCCCTAAATA
+071113_EAS56_0053:1:1:998:236
IIIIIIIIIIIIIIIIIIIIIIIIIIIIICII+III
@071113_EAS56_0053:1:1:182:712
ACCCAGCTAATTTTTGTATTTTTGTTAGAGACAGTG
+
@IIIIIIIIIIIIIIICDIIIII<%<6&-*).(*%+
@071113_EAS56_0053:1:1:153:10
TGTTCTGAAGGAAGGTGTGCGTGCGTGTGTGTGTGT
+
IIIIIIIIIIIICIIGIIIII>IAIIIE65I=II:6
@071113_EAS56_0053:1:3:990:501
TGGGAGGTTTTATGTGGA
AAGCAGCAATGTACAAGA
+
IIIIIII.IIIIII1@44
@-7.%<&+/$/%4(++(%
This is four PHRED encoded FASTQ entries originally from an NCBI source
(given the read length of 36, these are probably Solexa Illumina reads where
the quality has been mapped onto the PHRED values).
This example has been edited to illustrate some of the nasty things allowed
in the FASTQ format. Firstly, on the "+" lines most but not all of the
(redundant) identifiers are omitted. In real files it is likely that all or
none of these extra identifiers will be present.
Secondly, while the first three sequences have been shown without line
breaks, the last has been split over multiple lines. In real files any line
breaks are likely to be consistent.
Thirdly, some of the quality string lines start with an "@" character. For
the second record this is unavoidable. However for the fourth sequence this
only happens because its quality string is split over two lines. A naive
parser could wrongly treat any line starting with an "@" as the beginning of
a new sequence! This code copes with this possible ambiguity by keeping
track of the length of the sequence which gives the expected length of the
quality string.
Using this tricky example file as input, this short bit of code demonstrates
what this parsing function would return:
>>> with open("Quality/tricky.fastq", "rU") as handle:
... for (title, sequence, quality) in FastqGeneralIterator(handle):
... print(title)
... print("%s %s" % (sequence, quality))
...
071113_EAS56_0053:1:1:998:236
TTTCTTGCCCCCATAGACTGAGACCTTCCCTAAATA IIIIIIIIIIIIIIIIIIIIIIIIIIIIICII+III
071113_EAS56_0053:1:1:182:712
ACCCAGCTAATTTTTGTATTTTTGTTAGAGACAGTG @IIIIIIIIIIIIIIICDIIIII<%<6&-*).(*%+
071113_EAS56_0053:1:1:153:10
TGTTCTGAAGGAAGGTGTGCGTGCGTGTGTGTGTGT IIIIIIIIIIIICIIGIIIII>IAIIIE65I=II:6
071113_EAS56_0053:1:3:990:501
TGGGAGGTTTTATGTGGAAAGCAGCAATGTACAAGA IIIIIII.IIIIII1@44@-7.%<&+/$/%4(++(%
Finally we note that some sources state that the quality string should
start with "!" (which using the PHRED mapping means the first letter always
has a quality score of zero). This rather restrictive rule is not widely
observed, so is therefore ignored here. One plus point about this "!" rule
is that (provided there are no line breaks in the quality sequence) it
would prevent the above problem with the "@" character.
FastqIlluminaIterator(handle, alphabet=SingleLetterAlphabet(), title2ids=None)
Parse Illumina 1.3 to 1.7 FASTQ like files (which differ in the quality mapping).
The optional arguments are the same as those for the FastqPhredIterator.
For each sequence in Illumina 1.3+ FASTQ files there is a matching string
encoding PHRED integer qualities using ASCII values with an offset of 64.
>>> from Bio import SeqIO
>>> record = SeqIO.read("Quality/illumina_faked.fastq", "fastq-illumina")
>>> print("%s %s" % (record.id, record.seq))
Test ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN
>>> max(record.letter_annotations["phred_quality"])
40
>>> min(record.letter_annotations["phred_quality"])
0
NOTE - Older versions of the Solexa/Illumina pipeline encoded Solexa scores
with an ASCII offset of 64. They are approximately equal but only for high
quality reads. If you have an old Solexa/Illumina file with negative
Solexa scores, and try and read this as an Illumina 1.3+ file it will fail:
>>> record2 = SeqIO.read("Quality/solexa_faked.fastq", "fastq-illumina")
Traceback (most recent call last):
...
ValueError: Invalid character in quality string
NOTE - True Sanger style FASTQ files use PHRED scores with an offset of 33.
FastqPhredIterator(handle, alphabet=SingleLetterAlphabet(), title2ids=None)
Generator function to iterate over FASTQ records (as SeqRecord objects).
- handle - input file
- alphabet - optional alphabet
- title2ids - A function that, when given the title line from the FASTQ
file (without the beginning >), will return the id, name and
description (in that order) for the record as a tuple of
strings. If this is not given, then the entire title line
will be used as the description, and the first word as the
id and name.
Note that use of title2ids matches that of Bio.SeqIO.FastaIO.
For each sequence in a (Sanger style) FASTQ file there is a matching string
encoding the PHRED qualities (integers between 0 and about 90) using ASCII
values with an offset of 33.
For example, consider a file containing three short reads::
@EAS54_6_R1_2_1_413_324
CCCTTCTTGTCTTCAGCGTTTCTCC
+
;;3;;;;;;;;;;;;7;;;;;;;88
@EAS54_6_R1_2_1_540_792
TTGGCAGGCCAAGGCCGATGGATCA
+
;;;;;;;;;;;7;;;;;-;;;3;83
@EAS54_6_R1_2_1_443_348
GTTGCTTCTGGCGTGGGTGGGGGGG
+
;;;;;;;;;;;9;7;;.7;393333
For each sequence (e.g. "CCCTTCTTGTCTTCAGCGTTTCTCC") there is a matching
string encoding the PHRED qualities using a ASCII values with an offset of
33 (e.g. ";;3;;;;;;;;;;;;7;;;;;;;88").
Using this module directly you might run:
>>> with open("Quality/example.fastq", "rU") as handle:
... for record in FastqPhredIterator(handle):
... print("%s %s" % (record.id, record.seq))
EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC
EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA
EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG
Typically however, you would call this via Bio.SeqIO instead with "fastq"
(or "fastq-sanger") as the format:
>>> from Bio import SeqIO
>>> with open("Quality/example.fastq", "rU") as handle:
... for record in SeqIO.parse(handle, "fastq"):
... print("%s %s" % (record.id, record.seq))
EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC
EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA
EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG
If you want to look at the qualities, they are record in each record's
per-letter-annotation dictionary as a simple list of integers:
>>> print(record.letter_annotations["phred_quality"])
[26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18]
FastqSolexaIterator(handle, alphabet=SingleLetterAlphabet(), title2ids=None)
Parsing old Solexa/Illumina FASTQ like files (which differ in the quality mapping).
The optional arguments are the same as those for the FastqPhredIterator.
For each sequence in Solexa/Illumina FASTQ files there is a matching string
encoding the Solexa integer qualities using ASCII values with an offset
of 64. Solexa scores are scaled differently to PHRED scores, and Biopython
will NOT perform any automatic conversion when loading.
NOTE - This file format is used by the OLD versions of the Solexa/Illumina
pipeline. See also the FastqIlluminaIterator function for the NEW version.
For example, consider a file containing these five records::
@SLXA-B3_649_FC8437_R1_1_1_610_79
GATGTGCAATACCTTTGTAGAGGAA
+SLXA-B3_649_FC8437_R1_1_1_610_79
YYYYYYYYYYYYYYYYYYWYWYYSU
@SLXA-B3_649_FC8437_R1_1_1_397_389
GGTTTGAGAAAGAGAAATGAGATAA
+SLXA-B3_649_FC8437_R1_1_1_397_389
YYYYYYYYYWYYYYWWYYYWYWYWW
@SLXA-B3_649_FC8437_R1_1_1_850_123
GAGGGTGTTGATCATGATGATGGCG
+SLXA-B3_649_FC8437_R1_1_1_850_123
YYYYYYYYYYYYYWYYWYYSYYYSY
@SLXA-B3_649_FC8437_R1_1_1_362_549
GGAAACAAAGTTTTTCTCAACATAG
+SLXA-B3_649_FC8437_R1_1_1_362_549
YYYYYYYYYYYYYYYYYYWWWWYWY
@SLXA-B3_649_FC8437_R1_1_1_183_714
GTATTATTTAATGGCATACACTCAA
+SLXA-B3_649_FC8437_R1_1_1_183_714
YYYYYYYYYYWYYYYWYWWUWWWQQ
Using this module directly you might run:
>>> with open("Quality/solexa_example.fastq", "rU") as handle:
... for record in FastqSolexaIterator(handle):
... print("%s %s" % (record.id, record.seq))
SLXA-B3_649_FC8437_R1_1_1_610_79 GATGTGCAATACCTTTGTAGAGGAA
SLXA-B3_649_FC8437_R1_1_1_397_389 GGTTTGAGAAAGAGAAATGAGATAA
SLXA-B3_649_FC8437_R1_1_1_850_123 GAGGGTGTTGATCATGATGATGGCG
SLXA-B3_649_FC8437_R1_1_1_362_549 GGAAACAAAGTTTTTCTCAACATAG
SLXA-B3_649_FC8437_R1_1_1_183_714 GTATTATTTAATGGCATACACTCAA
Typically however, you would call this via Bio.SeqIO instead with
"fastq-solexa" as the format:
>>> from Bio import SeqIO
>>> with open("Quality/solexa_example.fastq", "rU") as handle:
... for record in SeqIO.parse(handle, "fastq-solexa"):
... print("%s %s" % (record.id, record.seq))
SLXA-B3_649_FC8437_R1_1_1_610_79 GATGTGCAATACCTTTGTAGAGGAA
SLXA-B3_649_FC8437_R1_1_1_397_389 GGTTTGAGAAAGAGAAATGAGATAA
SLXA-B3_649_FC8437_R1_1_1_850_123 GAGGGTGTTGATCATGATGATGGCG
SLXA-B3_649_FC8437_R1_1_1_362_549 GGAAACAAAGTTTTTCTCAACATAG
SLXA-B3_649_FC8437_R1_1_1_183_714 GTATTATTTAATGGCATACACTCAA
If you want to look at the qualities, they are recorded in each record's
per-letter-annotation dictionary as a simple list of integers:
>>> print(record.letter_annotations["solexa_quality"])
[25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 23, 25, 25, 25, 25, 23, 25, 23, 23, 21, 23, 23, 23, 17, 17]
These scores aren't very good, but they are high enough that they map
almost exactly onto PHRED scores:
>>> print("%0.2f" % phred_quality_from_solexa(25))
25.01
Let's look at faked example read which is even worse, where there are
more noticeable differences between the Solexa and PHRED scores::
@slxa_0001_1_0001_01
ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN
+slxa_0001_1_0001_01
hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@?>=<;
Again, you would typically use Bio.SeqIO to read this file in (rather than
calling the Bio.SeqIO.QualtityIO module directly). Most FASTQ files will
contain thousands of reads, so you would normally use Bio.SeqIO.parse()
as shown above. This example has only as one entry, so instead we can
use the Bio.SeqIO.read() function:
>>> from Bio import SeqIO
>>> with open("Quality/solexa_faked.fastq", "rU") as handle:
... record = SeqIO.read(handle, "fastq-solexa")
>>> print("%s %s" % (record.id, record.seq))
slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN
>>> print(record.letter_annotations["solexa_quality"])
[40, 39, 38, 37, 36, 35, 34, 33, 32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0, -1, -2, -3, -4, -5]
These quality scores are so low that when converted from the Solexa scheme
into PHRED scores they look quite different:
>>> print("%0.2f" % phred_quality_from_solexa(-1))
2.54
>>> print("%0.2f" % phred_quality_from_solexa(-5))
1.19
Note you can use the Bio.SeqIO.write() function or the SeqRecord's format
method to output the record(s):
>>> print(record.format("fastq-solexa"))
@slxa_0001_1_0001_01
ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN
+
hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@?>=<;
<BLANKLINE>
Note this output is slightly different from the input file as Biopython
has left out the optional repetition of the sequence identifier on the "+"
line. If you want the to use PHRED scores, use "fastq" or "qual" as the
output format instead, and Biopython will do the conversion for you:
>>> print(record.format("fastq"))
@slxa_0001_1_0001_01
ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN
+
IHGFEDCBA@?>=<;:9876543210/.-,++*)('&&%%$$##""
<BLANKLINE>
>>> print(record.format("qual"))
>slxa_0001_1_0001_01
40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21
20 19 18 17 16 15 14 13 12 11 10 10 9 8 7 6 5 5 4 4 3 3 2 2
1 1
<BLANKLINE>
As shown above, the poor quality Solexa reads have been mapped to the
equivalent PHRED score (e.g. -5 to 1 as shown earlier).
PairedFastaQualIterator(fasta_handle, qual_handle, alphabet=SingleLetterAlphabet(), title2ids=None)
Iterate over matched FASTA and QUAL files as SeqRecord objects.
For example, consider this short QUAL file with PHRED quality scores::
>EAS54_6_R1_2_1_413_324
26 26 18 26 26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26
26 26 26 23 23
>EAS54_6_R1_2_1_540_792
26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26 26 12 26 26
26 18 26 23 18
>EAS54_6_R1_2_1_443_348
26 26 26 26 26 26 26 26 26 26 26 24 26 22 26 26 13 22 26 18
24 18 18 18 18
And a matching FASTA file::
>EAS54_6_R1_2_1_413_324
CCCTTCTTGTCTTCAGCGTTTCTCC
>EAS54_6_R1_2_1_540_792
TTGGCAGGCCAAGGCCGATGGATCA
>EAS54_6_R1_2_1_443_348
GTTGCTTCTGGCGTGGGTGGGGGGG
You can parse these separately using Bio.SeqIO with the "qual" and
"fasta" formats, but then you'll get a group of SeqRecord objects with
no sequence, and a matching group with the sequence but not the
qualities. Because it only deals with one input file handle, Bio.SeqIO
can't be used to read the two files together - but this function can!
For example,
>>> with open("Quality/example.fasta", "rU") as f:
... with open("Quality/example.qual", "rU") as q:
... for record in PairedFastaQualIterator(f, q):
... print("%s %s" % (record.id, record.seq))
...
EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC
EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA
EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG
As with the FASTQ or QUAL parsers, if you want to look at the qualities,
they are in each record's per-letter-annotation dictionary as a simple
list of integers:
>>> print(record.letter_annotations["phred_quality"])
[26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18]
If you have access to data as a FASTQ format file, using that directly
would be simpler and more straight forward. Note that you can easily use
this function to convert paired FASTA and QUAL files into FASTQ files:
>>> from Bio import SeqIO
>>> with open("Quality/example.fasta", "rU") as f:
... with open("Quality/example.qual", "rU") as q:
... SeqIO.write(PairedFastaQualIterator(f, q), "Quality/temp.fastq", "fastq")
...
3
And don't forget to clean up the temp file if you don't need it anymore:
>>> import os
>>> os.remove("Quality/temp.fastq")
QualPhredIterator(handle, alphabet=SingleLetterAlphabet(), title2ids=None)
For QUAL files which include PHRED quality scores, but no sequence.
For example, consider this short QUAL file::
>EAS54_6_R1_2_1_413_324
26 26 18 26 26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26
26 26 26 23 23
>EAS54_6_R1_2_1_540_792
26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26 26 12 26 26
26 18 26 23 18
>EAS54_6_R1_2_1_443_348
26 26 26 26 26 26 26 26 26 26 26 24 26 22 26 26 13 22 26 18
24 18 18 18 18
Using this module directly you might run:
>>> with open("Quality/example.qual", "rU") as handle:
... for record in QualPhredIterator(handle):
... print("%s %s" % (record.id, record.seq))
EAS54_6_R1_2_1_413_324 ?????????????????????????
EAS54_6_R1_2_1_540_792 ?????????????????????????
EAS54_6_R1_2_1_443_348 ?????????????????????????
Typically however, you would call this via Bio.SeqIO instead with "qual"
as the format:
>>> from Bio import SeqIO
>>> with open("Quality/example.qual", "rU") as handle:
... for record in SeqIO.parse(handle, "qual"):
... print("%s %s" % (record.id, record.seq))
EAS54_6_R1_2_1_413_324 ?????????????????????????
EAS54_6_R1_2_1_540_792 ?????????????????????????
EAS54_6_R1_2_1_443_348 ?????????????????????????
Becase QUAL files don't contain the sequence string itself, the seq
property is set to an UnknownSeq object. As no alphabet was given, this
has defaulted to a generic single letter alphabet and the character "?"
used.
By specifying a nucleotide alphabet, "N" is used instead:
>>> from Bio import SeqIO
>>> from Bio.Alphabet import generic_dna
>>> with open("Quality/example.qual", "rU") as handle:
... for record in SeqIO.parse(handle, "qual", alphabet=generic_dna):
... print("%s %s" % (record.id, record.seq))
EAS54_6_R1_2_1_413_324 NNNNNNNNNNNNNNNNNNNNNNNNN
EAS54_6_R1_2_1_540_792 NNNNNNNNNNNNNNNNNNNNNNNNN
EAS54_6_R1_2_1_443_348 NNNNNNNNNNNNNNNNNNNNNNNNN
However, the quality scores themselves are available as a list of integers
in each record's per-letter-annotation:
>>> print(record.letter_annotations["phred_quality"])
[26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18]
You can still slice one of these SeqRecord objects with an UnknownSeq:
>>> sub_record = record[5:10]
>>> print("%s %s" % (sub_record.id, sub_record.letter_annotations["phred_quality"]))
EAS54_6_R1_2_1_443_348 [26, 26, 26, 26, 26]
As of Biopython 1.59, this parser will accept files with negatives quality
scores but will replace them with the lowest possible PHRED score of zero.
This will trigger a warning, previously it raised a ValueError exception.
log(...)
log(x[, base])
Return the logarithm of x to the given base.
If the base not specified, returns the natural logarithm (base e) of x.
phred_quality_from_solexa(solexa_quality)
Convert a Solexa quality (which can be negative) to a PHRED quality.
PHRED and Solexa quality scores are both log transformations of a
probality of error (high score = low probability of error). This function
takes a Solexa score, transforms it back to a probability of error, and
then re-expresses it as a PHRED score. This assumes the error estimates
are equivalent.
The underlying formulas are given in the documentation for the sister
function solexa_quality_from_phred, in this case the operation is::
phred_quality = 10*log(10**(solexa_quality/10.0) + 1, 10)
This will return a floating point number, it is up to you to round this to
the nearest integer if appropriate. e.g.
>>> print("%0.2f" % round(phred_quality_from_solexa(80), 2))
80.00
>>> print("%0.2f" % round(phred_quality_from_solexa(20), 2))
20.04
>>> print("%0.2f" % round(phred_quality_from_solexa(10), 2))
10.41
>>> print("%0.2f" % round(phred_quality_from_solexa(0), 2))
3.01
>>> print("%0.2f" % round(phred_quality_from_solexa(-5), 2))
1.19
Note that a solexa_quality less then -5 is not expected, will trigger a
warning, but will still be converted as per the logarithmic mapping
(giving a number between 0 and 1.19 back).
As a special case where None is used for a "missing value", None is
returned:
>>> print(phred_quality_from_solexa(None))
None
solexa_quality_from_phred(phred_quality)
Covert a PHRED quality (range 0 to about 90) to a Solexa quality.
PHRED and Solexa quality scores are both log transformations of a
probality of error (high score = low probability of error). This function
takes a PHRED score, transforms it back to a probability of error, and
then re-expresses it as a Solexa score. This assumes the error estimates
are equivalent.
How does this work exactly? Well the PHRED quality is minus ten times the
base ten logarithm of the probability of error::
phred_quality = -10*log(error,10)
Therefore, turning this round::
error = 10 ** (- phred_quality / 10)
Now, Solexa qualities use a different log transformation::
solexa_quality = -10*log(error/(1-error),10)
After substitution and a little manipulation we get::
solexa_quality = 10*log(10**(phred_quality/10.0) - 1, 10)
However, real Solexa files use a minimum quality of -5. This does have a
good reason - a random base call would be correct 25% of the time,
and thus have a probability of error of 0.75, which gives 1.25 as the PHRED
quality, or -4.77 as the Solexa quality. Thus (after rounding), a random
nucleotide read would have a PHRED quality of 1, or a Solexa quality of -5.
Taken literally, this logarithic formula would map a PHRED quality of zero
to a Solexa quality of minus infinity. Of course, taken literally, a PHRED
score of zero means a probability of error of one (i.e. the base call is
definitely wrong), which is worse than random! In practice, a PHRED quality
of zero usually means a default value, or perhaps random - and therefore
mapping it to the minimum Solexa score of -5 is reasonable.
In conclusion, we follow EMBOSS, and take this logarithmic formula but also
apply a minimum value of -5.0 for the Solexa quality, and also map a PHRED
quality of zero to -5.0 as well.
Note this function will return a floating point number, it is up to you to
round this to the nearest integer if appropriate. e.g.
>>> print("%0.2f" % round(solexa_quality_from_phred(80), 2))
80.00
>>> print("%0.2f" % round(solexa_quality_from_phred(50), 2))
50.00
>>> print("%0.2f" % round(solexa_quality_from_phred(20), 2))
19.96
>>> print("%0.2f" % round(solexa_quality_from_phred(10), 2))
9.54
>>> print("%0.2f" % round(solexa_quality_from_phred(5), 2))
3.35
>>> print("%0.2f" % round(solexa_quality_from_phred(4), 2))
1.80
>>> print("%0.2f" % round(solexa_quality_from_phred(3), 2))
-0.02
>>> print("%0.2f" % round(solexa_quality_from_phred(2), 2))
-2.33
>>> print("%0.2f" % round(solexa_quality_from_phred(1), 2))
-5.00
>>> print("%0.2f" % round(solexa_quality_from_phred(0), 2))
-5.00
Notice that for high quality reads PHRED and Solexa scores are numerically
equal. The differences are important for poor quality reads, where PHRED
has a minimum of zero but Solexa scores can be negative.
Finally, as a special case where None is used for a "missing value", None
is returned:
>>> print(solexa_quality_from_phred(None))
None
DATA
SANGER_SCORE_OFFSET = 33
SOLEXA_SCORE_OFFSET = 64
__docformat__ = 'restructuredtext en'
print_function = _Feature((2, 6, 0, 'alpha', 2), (3, 0, 0, 'alpha', 0)...
single_letter_alphabet = SingleLetterAlphabet()
FILE
/home/tiago_antao/miniconda/lib/python3.5/site-packages/Bio/SeqIO/QualityIO.py
Converting FASTA and QUAL files into FASTQ files¶
FASTQ files hold both sequences and their quality strings. FASTA files hold just sequences, while QUAL files hold just the qualities. Therefore a single FASTQ file can be converted to or from paired FASTA and QUAL files.
Going from FASTQ to FASTA is easy:
from Bio import SeqIO
SeqIO.convert("example.fastq", "fastq", "example.fasta", "fasta")
Going from FASTQ to QUAL is also easy:
from Bio import SeqIO
SeqIO.convert("example.fastq", "fastq", "example.qual", "qual")
However, the reverse is a little more tricky. You can use
Bio.SeqIO.parse()
to iterate over the records in a single file,
but in this case we have two input files. There are several strategies
possible, but assuming that the two files are really paired the most
memory efficient way is to loop over both together. The code is a little
fiddly, so we provide a function called PairedFastaQualIterator
in
the Bio.SeqIO.QualityIO
module to do this. This takes two handles
(the FASTA file and the QUAL file) and returns a SeqRecord
iterator:
from Bio.SeqIO.QualityIO import PairedFastaQualIterator
for record in PairedFastaQualIterator(open("example.fasta"), open("example.qual")):
print(record)
This function will check that the FASTA and QUAL files are consistent
(e.g. the records are in the same order, and have the same sequence
length). You can combine this with the Bio.SeqIO.write()
function to
convert a pair of FASTA and QUAL files into a single FASTQ files:
from Bio import SeqIO
from Bio.SeqIO.QualityIO import PairedFastaQualIterator
handle = open("temp.fastq", "w") #w=write
records = PairedFastaQualIterator(open("example.fasta"), open("example.qual"))
count = SeqIO.write(records, handle, "fastq")
handle.close()
print("Converted %i records" % count)
Indexing a FASTQ file¶
FASTQ files are often very large, with millions of reads in them. Due to
the sheer amount of data, you can’t load all the records into memory at
once. This is why the examples above (filtering and trimming) iterate
over the file looking at just one SeqRecord
at a time.
However, sometimes you can’t use a big loop or an iterator - you may
need random access to the reads. Here the Bio.SeqIO.index()
function
may prove very helpful, as it allows you to access any read in the FASTQ
file by its name (see Section [sec:SeqIO-index]).
Again we’ll use the SRR020192.fastq
file from the ENA
(ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR020/SRR020192/SRR020192.fastq.gz),
although this is actually quite a small FASTQ file with less than
\(50,000\) reads:
In [9]:
!wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR020/SRR020192/SRR020192.fastq.gz
!gzip -d SRR020192.fastq.gz
In [10]:
from Bio import SeqIO
fq_dict = SeqIO.index("SRR020192.fastq", "fastq")
len(fq_dict)
Out[10]:
41892
In [12]:
fq_dict["SRR020192.23186"].seq
Out[12]:
Seq('GTCCCAGTATTCGGATTTGTCTGCCAAAACAATGAAATTGACACAGTTTACAAC...CCG', SingleLetterAlphabet())
When testing this on a FASTQ file with seven million reads, indexing took about a minute, but record access was almost instant.
The example in Section [sec:SeqIO-sort] show how you can use the
Bio.SeqIO.index()
function to sort a large FASTA file – this could
also be used on FASTQ files.
Converting SFF files¶
If you work with 454 (Roche) sequence data, you will probably have access to the raw data as a Standard Flowgram Format (SFF) file. This contains the sequence reads (called bases) with quality scores and the original flow information.
A common task is to convert from SFF to a pair of FASTA and QUAL files,
or to a single FASTQ file. These operations are trivial using the
Bio.SeqIO.convert()
function (see Section [sec:SeqIO-conversion]):
In [14]:
from Bio import SeqIO
SeqIO.convert("data/E3MFGYR02_random_10_reads.sff", "sff", "reads.fasta", "fasta")
Out[14]:
10
In [15]:
SeqIO.convert("data/E3MFGYR02_random_10_reads.sff", "sff", "reads.qual", "qual")
Out[15]:
10
In [16]:
SeqIO.convert("data/E3MFGYR02_random_10_reads.sff", "sff", "reads.fastq", "fastq")
Out[16]:
10
Remember the convert function returns the number of records, in this example just ten. This will give you the untrimmed reads, where the leading and trailing poor quality sequence or adaptor will be in lower case. If you want the trimmed reads (using the clipping information recorded within the SFF file) use this:
In [17]:
from Bio import SeqIO
SeqIO.convert("data/E3MFGYR02_random_10_reads.sff", "sff-trim", "trimmed.fasta", "fasta")
Out[17]:
10
In [18]:
SeqIO.convert("data/E3MFGYR02_random_10_reads.sff", "sff-trim", "trimmed.qual", "qual")
Out[18]:
10
In [19]:
SeqIO.convert("data/E3MFGYR02_random_10_reads.sff", "sff-trim", "trimmed.fastq", "fastq")
Out[19]:
10
If you run Linux, you could ask Roche for a copy of their “off instrument” tools (often referred to as the Newbler tools). This offers an alternative way to do SFF to FASTA or QUAL conversion at the command line (but currently FASTQ output is not supported), e.g.
$ sffinfo -seq -notrim E3MFGYR02_random_10_reads.sff > reads.fasta
$ sffinfo -qual -notrim E3MFGYR02_random_10_reads.sff > reads.qual
$ sffinfo -seq -trim E3MFGYR02_random_10_reads.sff > trimmed.fasta
$ sffinfo -qual -trim E3MFGYR02_random_10_reads.sff > trimmed.qual
The way Biopython uses mixed case sequence strings to represent the trimming points deliberately mimics what the Roche tools do.
For more information on the Biopython SFF support, consult the built in help:
In [20]:
from Bio.SeqIO import SffIO
help(SffIO)
Help on module Bio.SeqIO.SffIO in Bio.SeqIO:
NAME
Bio.SeqIO.SffIO - Bio.SeqIO support for the binary Standard Flowgram Format (SFF) file format.
DESCRIPTION
SFF was designed by 454 Life Sciences (Roche), the Whitehead Institute for
Biomedical Research and the Wellcome Trust Sanger Institute. SFF was also used
as the native output format from early versions of Ion Torrent's PGM platform
as well. You are expected to use this module via the Bio.SeqIO functions under
the format name "sff" (or "sff-trim" as described below).
For example, to iterate over the records in an SFF file,
>>> from Bio import SeqIO
>>> for record in SeqIO.parse("Roche/E3MFGYR02_random_10_reads.sff", "sff"):
... print("%s %i %s..." % (record.id, len(record), record.seq[:20]))
...
E3MFGYR02JWQ7T 265 tcagGGTCTACATGTTGGTT...
E3MFGYR02JA6IL 271 tcagTTTTTTTTGGAAAGGA...
E3MFGYR02JHD4H 310 tcagAAAGACAAGTGGTATC...
E3MFGYR02GFKUC 299 tcagCGGCCGGGCCTCTCAT...
E3MFGYR02FTGED 281 tcagTGGTAATGGGGGGAAA...
E3MFGYR02FR9G7 261 tcagCTCCGTAAGAAGGTGC...
E3MFGYR02GAZMS 278 tcagAAAGAAGTAAGGTAAA...
E3MFGYR02HHZ8O 221 tcagACTTTCTTCTTTACCG...
E3MFGYR02GPGB1 269 tcagAAGCAGTGGTATCAAC...
E3MFGYR02F7Z7G 219 tcagAATCATCCACTTTTTA...
Each SeqRecord object will contain all the annotation from the SFF file,
including the PHRED quality scores.
>>> print("%s %i" % (record.id, len(record)))
E3MFGYR02F7Z7G 219
>>> print("%s..." % record.seq[:10])
tcagAATCAT...
>>> print("%r..." % (record.letter_annotations["phred_quality"][:10]))
[22, 21, 23, 28, 26, 15, 12, 21, 28, 21]...
Notice that the sequence is given in mixed case, the central upper case region
corresponds to the trimmed sequence. This matches the output of the Roche
tools (and the 3rd party tool sff_extract) for SFF to FASTA.
>>> print(record.annotations["clip_qual_left"])
4
>>> print(record.annotations["clip_qual_right"])
134
>>> print(record.seq[:4])
tcag
>>> print("%s...%s" % (record.seq[4:20], record.seq[120:134]))
AATCATCCACTTTTTA...CAAAACACAAACAG
>>> print(record.seq[134:])
atcttatcaacaaaactcaaagttcctaactgagacacgcaacaggggataagacaaggcacacaggggataggnnnnnnnnnnn
The annotations dictionary also contains any adapter clip positions
(usually zero), and information about the flows. e.g.
>>> len(record.annotations)
11
>>> print(record.annotations["flow_key"])
TCAG
>>> print(record.annotations["flow_values"][:10])
(83, 1, 128, 7, 4, 84, 6, 106, 3, 172)
>>> print(len(record.annotations["flow_values"]))
400
>>> print(record.annotations["flow_index"][:10])
(1, 2, 3, 2, 2, 0, 3, 2, 3, 3)
>>> print(len(record.annotations["flow_index"]))
219
Note that to convert from a raw reading in flow_values to the corresponding
homopolymer stretch estimate, the value should be rounded to the nearest 100:
>>> print("%r..." % [int(round(value, -2)) // 100
... for value in record.annotations["flow_values"][:10]])
...
[1, 0, 1, 0, 0, 1, 0, 1, 0, 2]...
If a read name is exactly 14 alphanumeric characters, the annotations
dictionary will also contain meta-data about the read extracted by
interpretting the name as a 454 Sequencing System "Universal" Accession
Number. Note that if a read name happens to be exactly 14 alphanumeric
characters but was not generated automatically, these annotation records
will contain nonsense information.
>>> print(record.annotations["region"])
2
>>> print(record.annotations["time"])
[2008, 1, 9, 16, 16, 0]
>>> print(record.annotations["coords"])
(2434, 1658)
As a convenience method, you can read the file with SeqIO format name "sff-trim"
instead of "sff" to get just the trimmed sequences (without any annotation
except for the PHRED quality scores and anything encoded in the read names):
>>> from Bio import SeqIO
>>> for record in SeqIO.parse("Roche/E3MFGYR02_random_10_reads.sff", "sff-trim"):
... print("%s %i %s..." % (record.id, len(record), record.seq[:20]))
...
E3MFGYR02JWQ7T 260 GGTCTACATGTTGGTTAACC...
E3MFGYR02JA6IL 265 TTTTTTTTGGAAAGGAAAAC...
E3MFGYR02JHD4H 292 AAAGACAAGTGGTATCAACG...
E3MFGYR02GFKUC 295 CGGCCGGGCCTCTCATCGGT...
E3MFGYR02FTGED 277 TGGTAATGGGGGGAAATTTA...
E3MFGYR02FR9G7 256 CTCCGTAAGAAGGTGCTGCC...
E3MFGYR02GAZMS 271 AAAGAAGTAAGGTAAATAAC...
E3MFGYR02HHZ8O 150 ACTTTCTTCTTTACCGTAAC...
E3MFGYR02GPGB1 221 AAGCAGTGGTATCAACGCAG...
E3MFGYR02F7Z7G 130 AATCATCCACTTTTTAACGT...
Looking at the final record in more detail, note how this differs to the
example above:
>>> print("%s %i" % (record.id, len(record)))
E3MFGYR02F7Z7G 130
>>> print("%s..." % record.seq[:10])
AATCATCCAC...
>>> print("%r..." % record.letter_annotations["phred_quality"][:10])
[26, 15, 12, 21, 28, 21, 36, 28, 27, 27]...
>>> len(record.annotations)
3
>>> print(record.annotations["region"])
2
>>> print(record.annotations["coords"])
(2434, 1658)
>>> print(record.annotations["time"])
[2008, 1, 9, 16, 16, 0]
You might use the Bio.SeqIO.convert() function to convert the (trimmed) SFF
reads into a FASTQ file (or a FASTA file and a QUAL file), e.g.
>>> from Bio import SeqIO
>>> try:
... from StringIO import StringIO # Python 2
... except ImportError:
... from io import StringIO # Python 3
...
>>> out_handle = StringIO()
>>> count = SeqIO.convert("Roche/E3MFGYR02_random_10_reads.sff", "sff",
... out_handle, "fastq")
...
>>> print("Converted %i records" % count)
Converted 10 records
The output FASTQ file would start like this:
>>> print("%s..." % out_handle.getvalue()[:50])
@E3MFGYR02JWQ7T
tcagGGTCTACATGTTGGTTAACCCGTACTGATT...
Bio.SeqIO.index() provides memory efficient random access to the reads in an
SFF file by name. SFF files can include an index within the file, which can
be read in making this very fast. If the index is missing (or in a format not
yet supported in Biopython) the file is indexed by scanning all the reads -
which is a little slower. For example,
>>> from Bio import SeqIO
>>> reads = SeqIO.index("Roche/E3MFGYR02_random_10_reads.sff", "sff")
>>> record = reads["E3MFGYR02JHD4H"]
>>> print("%s %i %s..." % (record.id, len(record), record.seq[:20]))
E3MFGYR02JHD4H 310 tcagAAAGACAAGTGGTATC...
>>> reads.close()
Or, using the trimmed reads:
>>> from Bio import SeqIO
>>> reads = SeqIO.index("Roche/E3MFGYR02_random_10_reads.sff", "sff-trim")
>>> record = reads["E3MFGYR02JHD4H"]
>>> print("%s %i %s..." % (record.id, len(record), record.seq[:20]))
E3MFGYR02JHD4H 292 AAAGACAAGTGGTATCAACG...
>>> reads.close()
You can also use the Bio.SeqIO.write() function with the "sff" format. Note
that this requires all the flow information etc, and thus is probably only
useful for SeqRecord objects originally from reading another SFF file (and
not the trimmed SeqRecord objects from parsing an SFF file as "sff-trim").
As an example, let's pretend this example SFF file represents some DNA which
was pre-amplified with a PCR primers AAAGANNNNN. The following script would
produce a sub-file containing all those reads whose post-quality clipping
region (i.e. the sequence after trimming) starts with AAAGA exactly (the non-
degenerate bit of this pretend primer):
>>> from Bio import SeqIO
>>> records = (record for record in
... SeqIO.parse("Roche/E3MFGYR02_random_10_reads.sff", "sff")
... if record.seq[record.annotations["clip_qual_left"]:].startswith("AAAGA"))
...
>>> count = SeqIO.write(records, "temp_filtered.sff", "sff")
>>> print("Selected %i records" % count)
Selected 2 records
Of course, for an assembly you would probably want to remove these primers.
If you want FASTA or FASTQ output, you could just slice the SeqRecord. However,
if you want SFF output we have to preserve all the flow information - the trick
is just to adjust the left clip position!
>>> from Bio import SeqIO
>>> def filter_and_trim(records, primer):
... for record in records:
... if record.seq[record.annotations["clip_qual_left"]:].startswith(primer):
... record.annotations["clip_qual_left"] += len(primer)
... yield record
...
>>> records = SeqIO.parse("Roche/E3MFGYR02_random_10_reads.sff", "sff")
>>> count = SeqIO.write(filter_and_trim(records, "AAAGA"),
... "temp_filtered.sff", "sff")
...
>>> print("Selected %i records" % count)
Selected 2 records
We can check the results, note the lower case clipped region now includes the "AAAGA"
sequence:
>>> for record in SeqIO.parse("temp_filtered.sff", "sff"):
... print("%s %i %s..." % (record.id, len(record), record.seq[:20]))
...
E3MFGYR02JHD4H 310 tcagaaagaCAAGTGGTATC...
E3MFGYR02GAZMS 278 tcagaaagaAGTAAGGTAAA...
>>> for record in SeqIO.parse("temp_filtered.sff", "sff-trim"):
... print("%s %i %s..." % (record.id, len(record), record.seq[:20]))
...
E3MFGYR02JHD4H 287 CAAGTGGTATCAACGCAGAG...
E3MFGYR02GAZMS 266 AGTAAGGTAAATAACAAACG...
>>> import os
>>> os.remove("temp_filtered.sff")
For a description of the file format, please see the Roche manuals and:
http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?cmd=show&f=formats&m=doc&s=formats
CLASSES
Bio.SeqIO.Interfaces.SequenceWriter(builtins.object)
SffWriter
class SffWriter(Bio.SeqIO.Interfaces.SequenceWriter)
| SFF file writer.
|
| Method resolution order:
| SffWriter
| Bio.SeqIO.Interfaces.SequenceWriter
| builtins.object
|
| Methods defined here:
|
| __init__(self, handle, index=True, xml=None)
| Creates the writer object.
|
| - handle - Output handle, ideally in binary write mode.
| - index - Boolean argument, should we try and write an index?
| - xml - Optional string argument, xml manifest to be recorded in the index
| block (see function ReadRocheXmlManifest for reading this data).
|
| write_file(self, records)
| Use this to write an entire file containing the given records.
|
| write_header(self)
|
| write_record(self, record)
| Write a single additional record to the output file.
|
| This assumes the header has been done.
|
| ----------------------------------------------------------------------
| Methods inherited from Bio.SeqIO.Interfaces.SequenceWriter:
|
| clean(self, text)
| Use this to avoid getting newlines in the output.
|
| ----------------------------------------------------------------------
| Data descriptors inherited from Bio.SeqIO.Interfaces.SequenceWriter:
|
| __dict__
| dictionary for instance variables (if defined)
|
| __weakref__
| list of weak references to the object (if defined)
FUNCTIONS
ReadRocheXmlManifest(handle)
Reads any Roche style XML manifest data in the SFF "index".
The SFF file format allows for multiple different index blocks, and Roche
took advantage of this to define their own index block which also embeds
an XML manifest string. This is not a publically documented extension to
the SFF file format, this was reverse engineered.
The handle should be to an SFF file opened in binary mode. This function
will use the handle seek/tell functions and leave the handle in an
arbitrary location.
Any XML manifest found is returned as a Python string, which you can then
parse as appropriate, or reuse when writing out SFF files with the
SffWriter class.
Returns a string, or raises a ValueError if an Roche manifest could not be
found.
SffIterator(handle, alphabet=DNAAlphabet(), trim=False)
Iterate over Standard Flowgram Format (SFF) reads (as SeqRecord objects).
- handle - input file, an SFF file, e.g. from Roche 454 sequencing.
This must NOT be opened in universal read lines mode!
- alphabet - optional alphabet, defaults to generic DNA.
- trim - should the sequences be trimmed?
The resulting SeqRecord objects should match those from a paired FASTA
and QUAL file converted from the SFF file using the Roche 454 tool
ssfinfo. i.e. The sequence will be mixed case, with the trim regions
shown in lower case.
This function is used internally via the Bio.SeqIO functions:
>>> from Bio import SeqIO
>>> for record in SeqIO.parse("Roche/E3MFGYR02_random_10_reads.sff", "sff"):
... print("%s %i" % (record.id, len(record)))
...
E3MFGYR02JWQ7T 265
E3MFGYR02JA6IL 271
E3MFGYR02JHD4H 310
E3MFGYR02GFKUC 299
E3MFGYR02FTGED 281
E3MFGYR02FR9G7 261
E3MFGYR02GAZMS 278
E3MFGYR02HHZ8O 221
E3MFGYR02GPGB1 269
E3MFGYR02F7Z7G 219
You can also call it directly:
>>> with open("Roche/E3MFGYR02_random_10_reads.sff", "rb") as handle:
... for record in SffIterator(handle):
... print("%s %i" % (record.id, len(record)))
...
E3MFGYR02JWQ7T 265
E3MFGYR02JA6IL 271
E3MFGYR02JHD4H 310
E3MFGYR02GFKUC 299
E3MFGYR02FTGED 281
E3MFGYR02FR9G7 261
E3MFGYR02GAZMS 278
E3MFGYR02HHZ8O 221
E3MFGYR02GPGB1 269
E3MFGYR02F7Z7G 219
Or, with the trim option:
>>> with open("Roche/E3MFGYR02_random_10_reads.sff", "rb") as handle:
... for record in SffIterator(handle, trim=True):
... print("%s %i" % (record.id, len(record)))
...
E3MFGYR02JWQ7T 260
E3MFGYR02JA6IL 265
E3MFGYR02JHD4H 292
E3MFGYR02GFKUC 295
E3MFGYR02FTGED 277
E3MFGYR02FR9G7 256
E3MFGYR02GAZMS 271
E3MFGYR02HHZ8O 150
E3MFGYR02GPGB1 221
E3MFGYR02F7Z7G 130
DATA
__docformat__ = 'restructuredtext en'
print_function = _Feature((2, 6, 0, 'alpha', 2), (3, 0, 0, 'alpha', 0)...
FILE
/home/tiago_antao/miniconda/lib/python3.5/site-packages/Bio/SeqIO/SffIO.py
Identifying open reading frames¶
A very simplistic first step at identifying possible genes is to look for open reading frames (ORFs). By this we mean look in all six frames for long regions without stop codons – an ORF is just a region of nucleotides with no in frame stop codons.
Of course, to find a gene you would also need to worry about locating a start codon, possible promoters – and in Eukaryotes there are introns to worry about too. However, this approach is still useful in viruses and Prokaryotes.
To show how you might approach this with Biopython, we’ll need a
sequence to search, and as an example we’ll again use the bacterial
plasmid – although this time we’ll start with a plain FASTA file with no
pre-marked genes:
`NC_005816.fna
<http://biopython.org/SRC/biopython/Tests/GenBank/NC_005816.fna>`__.
This is a bacterial sequence, so we’ll want to use NCBI codon table 11
(see Section [sec:translation] about translation).
In [22]:
from Bio import SeqIO
record = SeqIO.read("data/NC_005816.fna", "fasta")
table = 11
min_pro_len = 100
Here is a neat trick using the Seq
object’s split
method to get
a list of all the possible ORF translations in the six reading frames:
In [24]:
for strand, nuc in [(+1, record.seq), (-1, record.seq.reverse_complement())]:
for frame in range(3):
length = 3 * ((len(record)-frame) // 3) #Multiple of three
for pro in nuc[frame:frame+length].translate(table).split("*"):
if len(pro) >= min_pro_len:
print("%s...%s - length %i, strand %i, frame %i"
% (pro[:30], pro[-3:], len(pro), strand, frame))
GCLMKKSSIVATIITILSGSANAASSQLIP...YRF - length 315, strand 1, frame 0
KSGELRQTPPASSTLHLRLILQRSGVMMEL...NPE - length 285, strand 1, frame 1
GLNCSFFSICNWKFIDYINRLFQIIYLCKN...YYH - length 176, strand 1, frame 1
VKKILYIKALFLCTVIKLRRFIFSVNNMKF...DLP - length 165, strand 1, frame 1
NQIQGVICSPDSGEFMVTFETVMEIKILHK...GVA - length 355, strand 1, frame 2
RRKEHVSKKRRPQKRPRRRRFFHRLRPPDE...PTR - length 128, strand 1, frame 2
TGKQNSCQMSAIWQLRQNTATKTRQNRARI...AIK - length 100, strand 1, frame 2
QGSGYAFPHASILSGIAMSHFYFLVLHAVK...CSD - length 114, strand -1, frame 0
IYSTSEHTGEQVMRTLDEVIASRSPESQTR...FHV - length 111, strand -1, frame 0
WGKLQVIGLSMWMVLFSQRFDDWLNEQEDA...ESK - length 125, strand -1, frame 1
RGIFMSDTMVVNGSGGVPAFLFSGSTLSSY...LLK - length 361, strand -1, frame 1
WDVKTVTGVLHHPFHLTFSLCPEGATQSGR...VKR - length 111, strand -1, frame 1
LSHTVTDFTDQMAQVGLCQCVNVFLDEVTG...KAA - length 107, strand -1, frame 2
RALTGLSAPGIRSQTSCDRLRELRYVPVSL...PLQ - length 119, strand -1, frame 2
Note that here we are counting the frames from the 5’ end (start) of each strand. It is sometimes easier to always count from the 5’ end (start) of the forward strand.
You could easily edit the above loop based code to build up a list of the candidate proteins, or convert this to a list comprehension. Now, one thing this code doesn’t do is keep track of where the proteins are.
You could tackle this in several ways. For example, the following code tracks the locations in terms of the protein counting, and converts back to the parent sequence by multiplying by three, then adjusting for the frame and strand:
from Bio import SeqIO
record = SeqIO.read("NC_005816.gb","genbank")
table = 11
min_pro_len = 100
def find_orfs_with_trans(seq, trans_table, min_protein_length):
answer = []
seq_len = len(seq)
for strand, nuc in [(+1, seq), (-1, seq.reverse_complement())]:
for frame in range(3):
trans = str(nuc[frame:].translate(trans_table))
trans_len = len(trans)
aa_start = 0
aa_end = 0
while aa_start < trans_len:
aa_end = trans.find("*", aa_start)
if aa_end == -1:
aa_end = trans_len
if aa_end-aa_start >= min_protein_length:
if strand == 1:
start = frame+aa_start*3
end = min(seq_len,frame+aa_end*3+3)
else:
start = seq_len-frame-aa_end*3-3
end = seq_len-frame-aa_start*3
answer.append((start, end, strand,
trans[aa_start:aa_end]))
aa_start = aa_end+1
answer.sort()
return answer
orf_list = find_orfs_with_trans(record.seq, table, min_pro_len)
for start, end, strand, pro in orf_list:
print("%s...%s - length %i, strand %i, %i:%i" \
% (pro[:30], pro[-3:], len(pro), strand, start, end))
And the output:
NQIQGVICSPDSGEFMVTFETVMEIKILHK...GVA - length 355, strand 1, 41:1109
WDVKTVTGVLHHPFHLTFSLCPEGATQSGR...VKR - length 111, strand -1, 491:827
KSGELRQTPPASSTLHLRLILQRSGVMMEL...NPE - length 285, strand 1, 1030:1888
RALTGLSAPGIRSQTSCDRLRELRYVPVSL...PLQ - length 119, strand -1, 2830:3190
RRKEHVSKKRRPQKRPRRRRFFHRLRPPDE...PTR - length 128, strand 1, 3470:3857
GLNCSFFSICNWKFIDYINRLFQIIYLCKN...YYH - length 176, strand 1, 4249:4780
RGIFMSDTMVVNGSGGVPAFLFSGSTLSSY...LLK - length 361, strand -1, 4814:5900
VKKILYIKALFLCTVIKLRRFIFSVNNMKF...DLP - length 165, strand 1, 5923:6421
LSHTVTDFTDQMAQVGLCQCVNVFLDEVTG...KAA - length 107, strand -1, 5974:6298
GCLMKKSSIVATIITILSGSANAASSQLIP...YRF - length 315, strand 1, 6654:7602
IYSTSEHTGEQVMRTLDEVIASRSPESQTR...FHV - length 111, strand -1, 7788:8124
WGKLQVIGLSMWMVLFSQRFDDWLNEQEDA...ESK - length 125, strand -1, 8087:8465
TGKQNSCQMSAIWQLRQNTATKTRQNRARI...AIK - length 100, strand 1, 8741:9044
QGSGYAFPHASILSGIAMSHFYFLVLHAVK...CSD - length 114, strand -1, 9264:9609
If you comment out the sort statement, then the protein sequences will be shown in the same order as before, so you can check this is doing the same thing. Here we have sorted them by location to make it easier to compare to the actual annotation in the GenBank file (as visualised in Section [sec:gd_nice_example]).
If however all you want to find are the locations of the open reading
frames, then it is a waste of time to translate every possible codon,
including doing the reverse complement to search the reverse strand too.
All you need to do is search for the possible stop codons (and their
reverse complements). Using regular expressions is an obvious approach
here (see the Python module re
). These are an extremely powerful
(but rather complex) way of describing search strings, which are
supported in lots of programming languages and also command line tools
like grep
as well). You can find whole books about this topic!
Sequence parsing plus simple plots¶
This section shows some more examples of sequence parsing, using the
Bio.SeqIO
module described in Chapter [chapter:Bio.SeqIO], plus the
Python library matplotlib’s pylab
plotting interface (see the
matplotlib website for a
tutorial). Note that to follow
these examples you will need matplotlib installed - but without it you
can still try the data parsing bits.
Histogram of sequence lengths¶
There are lots of times when you might want to visualise the distribution of sequence lengths in a dataset – for example the range of contig sizes in a genome assembly project. In this example we’ll reuse our orchid FASTA file ls_orchid.fasta which has only 94 sequences.
First of all, we will use Bio.SeqIO
to parse the FASTA file and
compile a list of all the sequence lengths. You could do this with a for
loop, but I find a list comprehension more pleasing:
In [28]:
from Bio import SeqIO
sizes = [len(rec) for rec in SeqIO.parse("data/ls_orchid.fasta", "fasta")]
len(sizes), min(sizes), max(sizes)
Out[28]:
(94, 572, 789)
In [29]:
sizes
Out[29]:
[740,
753,
748,
744,
733,
718,
730,
704,
740,
709,
700,
726,
753,
699,
658,
752,
726,
765,
755,
742,
762,
745,
750,
731,
741,
740,
727,
711,
743,
727,
757,
770,
767,
759,
750,
788,
774,
789,
688,
719,
743,
737,
728,
740,
696,
732,
731,
735,
720,
740,
629,
572,
587,
700,
636,
716,
592,
716,
733,
626,
737,
740,
574,
594,
610,
730,
641,
702,
733,
738,
736,
732,
745,
744,
738,
739,
740,
745,
695,
745,
743,
730,
706,
744,
742,
694,
712,
715,
688,
784,
721,
703,
744,
592]
Now that we have the lengths of all the genes (as a list of integers), we can use the matplotlib histogram function to display it.
from Bio import SeqIO
sizes = [len(rec) for rec in SeqIO.parse("ls_orchid.fasta", "fasta")]
import pylab
pylab.hist(sizes, bins=20)
pylab.title("%i orchid sequences\nLengths %i to %i" \
% (len(sizes),min(sizes),max(sizes)))
pylab.xlabel("Sequence length (bp)")
pylab.ylabel("Count")
pylab.show()
That should pop up a new window containing the following graph:
That should pop up a new window containing the graph shown in Figure [fig:seq-len-hist].
Notice that most of these orchid sequences are about \(740\) bp long, and there could be two distinct classes of sequence here with a subset of shorter sequences.
Tip: Rather than using pylab.show()
to show the plot in a window,
you can also use pylab.savefig(...)
to save the figure to a file
(e.g. as a PNG or PDF).
Plot of sequence GC%¶
Another easily calculated quantity of a nucleotide sequence is the GC%. You might want to look at the GC% of all the genes in a bacterial genome for example, and investigate any outliers which could have been recently acquired by horizontal gene transfer. Again, for this example we’ll reuse our orchid FASTA file ls_orchid.fasta.
First of all, we will use Bio.SeqIO
to parse the FASTA file and
compile a list of all the GC percentages. Again, you could do this with
a for loop, but I prefer this:
from Bio import SeqIO
from Bio.SeqUtils import GC
gc_values = sorted(GC(rec.seq) for rec in SeqIO.parse("ls_orchid.fasta", "fasta"))
Having read in each sequence and calculated the GC%, we then sorted them into ascending order. Now we’ll take this list of floating point values and plot them with matplotlib:
import pylab
pylab.plot(gc_values)
pylab.title("%i orchid sequences\nGC%% %0.1f to %0.1f" \
% (len(gc_values),min(gc_values),max(gc_values)))
pylab.xlabel("Genes")
pylab.ylabel("GC%")
pylab.show()
As in the previous example, that should pop up a new window containing a graph:
As in the previous example, that should pop up a new window with the graph shown in Figure [fig:seq-gc-plot].
If you tried this on the full set of genes from one organism, you’d probably get a much smoother plot than this.
Nucleotide dot plots¶
A dot plot is a way of visually comparing two nucleotide sequences for similarity to each other. A sliding window is used to compare short sub-sequences to each other, often with a mis-match threshold. Here for simplicity we’ll only look for perfect matches (shown in black
in Figure [fig:nuc-dot-plot]).
in the plot below).
To start off, we’ll need two sequences. For the sake of argument, we’ll just take the first two from our orchid FASTA file ls_orchid.fasta:
from Bio import SeqIO
handle = open("ls_orchid.fasta")
record_iterator = SeqIO.parse(handle, "fasta")
rec_one = next(record_iterator)
rec_two = next(record_iterator)
handle.close()
We’re going to show two approaches. Firstly, a simple naive implementation which compares all the window sized sub-sequences to each other to compiles a similarity matrix. You could construct a matrix or array object, but here we just use a list of lists of booleans created with a nested list comprehension:
window = 7
seq_one = str(rec_one.seq).upper()
seq_two = str(rec_two.seq).upper()
data = [[(seq_one[i:i+window] <> seq_two[j:j+window]) \
for j in range(len(seq_one)-window)] \
for i in range(len(seq_two)-window)]
Note that we have not checked for reverse complement matches here. Now
we’ll use the matplotlib’s pylab.imshow()
function to display this
data, first requesting the gray color scheme so this is done in black
and white:
import pylab
pylab.gray()
pylab.imshow(data)
pylab.xlabel("%s (length %i bp)" % (rec_one.id, len(rec_one)))
pylab.ylabel("%s (length %i bp)" % (rec_two.id, len(rec_two)))
pylab.title("Dot plot using window size %i\n(allowing no mis-matches)" % window)
pylab.show()
That should pop up a new window containing a graph like this:
That should pop up a new window showing the graph in Figure [fig:nuc-dot-plot].
As you might have expected, these two sequences are very similar with a partial line of window sized matches along the diagonal. There are no off diagonal matches which would be indicative of inversions or other interesting events.
The above code works fine on small examples, but there are two problems
applying this to larger sequences, which we will address below. First
off all, this brute force approach to the all against all comparisons is
very slow. Instead, we’ll compile dictionaries mapping the window sized
sub-sequences to their locations, and then take the set intersection to
find those sub-sequences found in both sequences. This uses more memory,
but is much faster. Secondly, the pylab.imshow()
function is
limited in the size of matrix it can display. As an alternative, we’ll
use the pylab.scatter()
function.
We start by creating dictionaries mapping the window-sized sub-sequences to locations:
window = 7
dict_one = {}
dict_two = {}
for (seq, section_dict) in [(str(rec_one.seq).upper(), dict_one),
(str(rec_two.seq).upper(), dict_two)]:
for i in range(len(seq)-window):
section = seq[i:i+window]
try:
section_dict[section].append(i)
except KeyError:
section_dict[section] = [i]
#Now find any sub-sequences found in both sequences
#(Python 2.3 would require slightly different code here)
matches = set(dict_one).intersection(dict_two)
print("%i unique matches" % len(matches))
In order to use the pylab.scatter()
we need separate lists for the
\(x\) and \(y\) co-ordinates:
#Create lists of x and y co-ordinates for scatter plot
x = []
y = []
for section in matches:
for i in dict_one[section]:
for j in dict_two[section]:
x.append(i)
y.append(j)
We are now ready to draw the revised dot plot as a scatter plot:
import pylab
pylab.cla() #clear any prior graph
pylab.gray()
pylab.scatter(x,y)
pylab.xlim(0, len(rec_one)-window)
pylab.ylim(0, len(rec_two)-window)
pylab.xlabel("%s (length %i bp)" % (rec_one.id, len(rec_one)))
pylab.ylabel("%s (length %i bp)" % (rec_two.id, len(rec_two)))
pylab.title("Dot plot using window size %i\n(allowing no mis-matches)" % window)
pylab.show()
That should pop up a new window containing a graph like this:
That should pop up a new window showing the graph in Figure [fig:nuc-dot-plot-scatter].
Personally I find this second plot much easier to read! Again note that we have not checked for reverse complement matches here – you could extend this example to do this, and perhaps plot the forward matches in one color and the reverse matches in another.
Plotting the quality scores of sequencing read data¶
If you are working with second generation sequencing data, you may want
to try plotting the quality data. Here is an example using two FASTQ
files containing paired end reads, SRR001666_1.fastq
for the forward
reads, and SRR001666_2.fastq
for the reverse reads. These were
downloaded from the ENA sequence read archive FTP site
(ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR001/SRR001666/SRR001666_1.fastq.gz
and
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR001/SRR001666/SRR001666_2.fastq.gz),
and are from E. coli – see
http://www.ebi.ac.uk/ena/data/view/SRR001666 for details. In the
following code the pylab.subplot(...)
function is used in order to
show the forward and reverse qualities on two subplots, side by side.
There is also a little bit of code to only plot the first fifty reads.
import pylab
from Bio import SeqIO
for subfigure in [1,2]:
filename = "SRR001666_%i.fastq" % subfigure
pylab.subplot(1, 2, subfigure)
for i,record in enumerate(SeqIO.parse(filename, "fastq")):
if i >= 50 : break #trick!
pylab.plot(record.letter_annotations["phred_quality"])
pylab.ylim(0,45)
pylab.ylabel("PHRED quality score")
pylab.xlabel("Position")
pylab.savefig("SRR001666.png")
print("Done")
You should note that we are using the Bio.SeqIO
format name
fastq
here because the NCBI has saved these reads using the standard
Sanger FASTQ format with PHRED scores. However, as you might guess from
the read lengths, this data was from an Illumina Genome Analyzer and was
probably originally in one of the two Solexa/Illumina FASTQ variant file
formats instead.
This example uses the pylab.savefig(...)
function instead of
pylab.show(...)
, but as mentioned before both are useful.
The result is shown in Figure [fig:paired-end-qual-plot].
Here is the result:
Dealing with alignments¶
This section can been seen as a follow on to Chapter [chapter:Bio.AlignIO].
Calculating summary information¶
Once you have an alignment, you are very likely going to want to find out information about it. Instead of trying to have all of the functions that can generate information about an alignment in the alignment object itself, we’ve tried to separate out the functionality into separate classes, which act on the alignment.
Getting ready to calculate summary information about an object is quick
to do. Let’s say we’ve got an alignment object called alignment
, for
example read in using Bio.AlignIO.read(...)
as described in
Chapter [chapter:Bio.AlignIO]. All we need to do to get an object that
will calculate summary information is:
from Bio.Align import AlignInfo
summary_align = AlignInfo.SummaryInfo(alignment)
The summary_align
object is very useful, and will do the following
neat things for you:
- Calculate a quick consensus sequence – see section [sec:consensus]
- Get a position specific score matrix for the alignment – see section [sec:pssm]
- Calculate the information content for the alignment – see section [sec:getting_info_content]
- Generate information on substitutions in the alignment – section [sec:sub_matrix] details using this to generate a substitution matrix.
Calculating a quick consensus sequence¶
The SummaryInfo
object, described in section [sec:summary_info],
provides functionality to calculate a quick consensus of an alignment.
Assuming we’ve got a SummaryInfo
object called summary_align
we
can calculate a consensus by doing:
consensus = summary_align.dumb_consensus()
As the name suggests, this is a really simple consensus calculator, and
will just add up all of the residues at each point in the consensus, and
if the most common value is higher than some threshold value will add
the common residue to the consensus. If it doesn’t reach the threshold,
it adds an ambiguity character to the consensus. The returned consensus
object is Seq object whose alphabet is inferred from the alphabets of
the sequences making up the consensus. So doing a print consensus
would give:
consensus Seq('TATACATNAAAGNAGGGGGATGCGGATAAATGGAAAGGCGAAAGAAAGAAAAAAATGAAT
...', IUPACAmbiguousDNA())
You can adjust how dumb_consensus
works by passing optional
parameters:
- the threshold
- This is the threshold specifying how common a particular residue has to be at a position before it is added. The default is \(0.7\) (meaning \(70\%\)).
- the ambiguous character
- This is the ambiguity character to use. The default is ’N’.
- the consensus alphabet
- This is the alphabet to use for the consensus sequence. If an alphabet is not specified than we will try to guess the alphabet based on the alphabets of the sequences in the alignment.
Position Specific Score Matrices¶
Position specific score matrices (PSSMs) summarize the alignment information in a different way than a consensus, and may be useful for different tasks. Basically, a PSSM is a count matrix. For each column in the alignment, the number of each alphabet letters is counted and totaled. The totals are displayed relative to some representative sequence along the left axis. This sequence may be the consesus sequence, but can also be any sequence in the alignment. For instance for the alignment,
GTATC
AT--C
CTGTC
the PSSM is:
G A T C
G 1 1 0 1
T 0 0 3 0
A 1 1 0 0
T 0 0 2 0
C 0 0 0 3
Let’s assume we’ve got an alignment object called c_align
. To get a
PSSM with the consensus sequence along the side we first get a summary
object and calculate the consensus sequence:
summary_align = AlignInfo.SummaryInfo(c_align)
consensus = summary_align.dumb_consensus()
Now, we want to make the PSSM, but ignore any N
ambiguity residues
when calculating this:
my_pssm = summary_align.pos_specific_score_matrix(consensus,
chars_to_ignore = ['N'])
Two notes should be made about this:
- To maintain strictness with the alphabets, you can only include characters along the top of the PSSM that are in the alphabet of the alignment object. Gaps are not included along the top axis of the PSSM.
- The sequence passed to be displayed along the left side of the axis does not need to be the consensus. For instance, if you wanted to display the second sequence in the alignment along this axis, you would need to do:
second_seq = alignment.get_seq_by_num(1)
my_pssm = summary_align.pos_specific_score_matrix(second_seq
chars_to_ignore = ['N'])
The command above returns a PSSM
object. To print out the PSSM as
shown above, we simply need to do a print(my_pssm)
, which gives:
A C G T
T 0.0 0.0 0.0 7.0
A 7.0 0.0 0.0 0.0
T 0.0 0.0 0.0 7.0
A 7.0 0.0 0.0 0.0
C 0.0 7.0 0.0 0.0
A 7.0 0.0 0.0 0.0
T 0.0 0.0 0.0 7.0
T 1.0 0.0 0.0 6.0
...
You can access any element of the PSSM by subscripting like
your_pssm[sequence_number][residue_count_name]
. For instance, to get
the counts for the ’A’ residue in the second element of the above PSSM
you would do:
In [31]:
print(my_pssm[1]["A"])
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-31-9d60637622f7> in <module>()
----> 1 print(my_pssm[1]["A"])
NameError: name 'my_pssm' is not defined
The structure of the PSSM class hopefully makes it easy both to access elements and to pretty print the matrix.
Information Content¶
A potentially useful measure of evolutionary conservation is the information content of a sequence.
A useful introduction to information theory targeted towards molecular biologists can be found at http://www.lecb.ncifcrf.gov/~toms/paper/primer/. For our purposes, we will be looking at the information content of a consesus sequence, or a portion of a consensus sequence. We calculate information content at a particular column in a multiple sequence alignment using the following formula:
where:
- \(IC_{j}\) – The information content for the \(j\)-th column in an alignment.
- \(N_{a}\) – The number of letters in the alphabet.
- \(P_{ij}\) – The frequency of a particular letter \(i\) in the \(j\)-th column (i. e. if G occurred 3 out of 6 times in an aligment column, this would be 0.5)
- \(Q_{i}\) – The expected frequency of a letter \(i\). This is an optional argument, usage of which is left at the user’s discretion. By default, it is automatically assigned to \(0.05 = 1/20\) for a protein alphabet, and \(0.25 = 1/4\) for a nucleic acid alphabet. This is for geting the information content without any assumption of prior distributions. When assuming priors, or when using a non-standard alphabet, you should supply the values for \(Q_{i}\).
Well, now that we have an idea what information content is being calculated in Biopython, let’s look at how to get it for a particular region of the alignment.
First, we need to use our alignment to get an alignment summary object,
which we’ll assume is called summary_align
(see
section [sec:summary_info]) for instructions on how to get this. Once
we’ve got this object, calculating the information content for a region
is as easy as:
info_content = summary_align.information_content(5, 30,
chars_to_ignore = ['N'])
Wow, that was much easier then the formula above made it look! The
variable info_content
now contains a float value specifying the
information content over the specified region (from 5 to 30 of the
alignment). We specifically ignore the ambiguity residue ’N’ when
calculating the information content, since this value is not included in
our alphabet (so we shouldn’t be interested in looking at it!).
As mentioned above, we can also calculate relative information content by supplying the expected frequencies:
expect_freq = {
'A' : .3,
'G' : .2,
'T' : .3,
'C' : .2}
The expected should not be passed as a raw dictionary, but instead by
passed as a SubsMat.FreqTable
object (see section [sec:freq_table]
for more information about FreqTables). The FreqTable object provides a
standard for associating the dictionary with an Alphabet, similar to how
the Biopython Seq class works.
To create a FreqTable object, from the frequency dictionary you just need to do:
from Bio.Alphabet import IUPAC
from Bio.SubsMat import FreqTable
e_freq_table = FreqTable.FreqTable(expect_freq, FreqTable.FREQ,
IUPAC.unambiguous_dna)
Now that we’ve got that, calculating the relative information content for our region of the alignment is as simple as:
info_content = summary_align.information_content(5, 30,
e_freq_table = e_freq_table,
chars_to_ignore = ['N'])
Now, info_content
will contain the relative information content over
the region in relation to the expected frequencies.
The value return is calculated using base 2 as the logarithm base in the
formula above. You can modify this by passing the parameter log_base
as the base you want:
info_content = summary_align.information_content(5, 30, log_base = 10,
chars_to_ignore = ['N'])
Well, now you are ready to calculate information content. If you want to try applying this to some real life problems, it would probably be best to dig into the literature on information content to get an idea of how it is used. Hopefully your digging won’t reveal any mistakes made in coding this function!
Substitution Matrices¶
Substitution matrices are an extremely important part of everyday bioinformatics work. They provide the scoring terms for classifying how likely two different residues are to substitute for each other. This is essential in doing sequence comparisons. The book “Biological Sequence Analysis” by Durbin et al. provides a really nice introduction to Substitution Matrices and their uses. Some famous substitution matrices are the PAM and BLOSUM series of matrices.
Biopython provides a ton of common substitution matrices, and also provides functionality for creating your own substitution matrices.
Using common substitution matrices¶
Creating your own substitution matrix from an alignment¶
A very cool thing that you can do easily with the substitution matrix classes is to create your own substitution matrix from an alignment. In practice, this is normally done with protein alignments. In this example, we’ll first get a Biopython alignment object and then get a summary object to calculate info about the alignment. The file containing protein.aln (also available online here) contains the Clustalw alignment output.
In [33]:
from Bio import AlignIO
from Bio import Alphabet
from Bio.Alphabet import IUPAC
from Bio.Align import AlignInfo
filename = "data/protein.aln"
alpha = Alphabet.Gapped(IUPAC.protein)
c_align = AlignIO.read(filename, "clustal", alphabet=alpha)
summary_align = AlignInfo.SummaryInfo(c_align)
Sections [sec:align_clustal] and [sec:summary_info] contain more information on doing this.
Now that we’ve got our summary_align
object, we want to use it to
find out the number of times different residues substitute for each
other. To make the example more readable, we’ll focus on only amino
acids with polar charged side chains. Luckily, this can be done easily
when generating a replacement dictionary, by passing in all of the
characters that should be ignored. Thus we’ll create a dictionary of
replacements for only charged polar amino acids using:
In [34]:
replace_info = summary_align.replacement_dictionary(["G", "A", "V", "L", "I",
"M", "P", "F", "W", "S",
"T", "N", "Q", "Y", "C"])
This information about amino acid replacements is represented as a python dictionary which will look something like (the order can vary):
{('R', 'R'): 2079.0, ('R', 'H'): 17.0, ('R', 'K'): 103.0, ('R', 'E'): 2.0,
('R', 'D'): 2.0, ('H', 'R'): 0, ('D', 'H'): 15.0, ('K', 'K'): 3218.0,
('K', 'H'): 24.0, ('H', 'K'): 8.0, ('E', 'H'): 15.0, ('H', 'H'): 1235.0,
('H', 'E'): 18.0, ('H', 'D'): 0, ('K', 'D'): 0, ('K', 'E'): 9.0,
('D', 'R'): 48.0, ('E', 'R'): 2.0, ('D', 'K'): 1.0, ('E', 'K'): 45.0,
('K', 'R'): 130.0, ('E', 'D'): 241.0, ('E', 'E'): 3305.0,
('D', 'E'): 270.0, ('D', 'D'): 2360.0}
This information gives us our accepted number of replacements, or how often we expect different things to substitute for each other. It turns out, amazingly enough, that this is all of the information we need to go ahead and create a substitution matrix. First, we use the replacement dictionary information to create an Accepted Replacement Matrix (ARM):
In [35]:
from Bio import SubsMat
my_arm = SubsMat.SeqMat(replace_info)
With this accepted replacement matrix, we can go right ahead and create our log odds matrix (i. e. a standard type Substitution Matrix):
In [36]:
my_lom = SubsMat.make_log_odds_matrix(my_arm)
The log odds matrix you create is customizable with the following optional arguments:
exp_freq_table
– You can pass a table of expected frequencies for each alphabet. If supplied, this will be used instead of the passed accepted replacement matrix when calculate expected replacments.logbase
- The base of the logarithm taken to create the log odd matrix. Defaults to base 10.factor
- The factor to multiply each matrix entry by. This defaults to 10, which normally makes the matrix numbers easy to work with.round_digit
- The digit to round to in the matrix. This defaults to 0 (i. e. no digits).
Once you’ve got your log odds matrix, you can display it prettily using
the function print_mat
. Doing this on our created matrix gives:
In [37]:
my_lom.print_mat()
D 2
E -1 1
H -5 -4 3
K -10 -5 -4 1
R -4 -8 -4 -2 2
D E H K R